Here is the usage information for running fc_ovlp_to_graph.py:
usage: fc_ovlp_to_graph.py [-h] [--min_len MIN_LEN] [--min_idt MIN_IDT] [--lfc] overlap_file a example string graph assembler that is desinged for handling diploid genomes positional arguments: overlap_file a file that contains the overlap information. optional arguments: -h, --help show this help message and exit --min_len MIN_LEN minimum length of the reads to be considered for assembling --min_idt MIN_IDT minimum alignment identity of the reads to be considered for assembling --lfc use local flow constraint method rather than best overlap method to resolve knots in string graph
In some case, you might want to lower the min_idt to keep more overlap or increase min_len to reduce the number of
overlap used for constructing the contig after the overlap filtering step. The
--lfc toggles the rule for resolving
local knots in the graph. If
--lfc is not specified, “the best overlapped edge” will be kept when there are multiple
in- or out- edges from a node while the others will be removed.
The first stage of the assembly is to construct the initial string graph and classify each edges in the string graph.
sg_edges_list contained the information of the information of the edges in the full string graph and the
classification. For example, 5 edges are shown in the five lines of the file below
$ head -5 sg_edges_list 000017363:B 000007817:E 000007817 10841 28901 10841 99.52 TR 000015379:E 000004331:B 000004331 6891 0 18178 99.35 TR 000006813:B 000000681:E 000000681 7609 23795 7616 99.72 TR 000002258:E 000002505:B 000002505 5850 0 17215 99.62 TR 000013449:B 000012565:B 000012565 3317 0 20570 99.72 G
The first two columns indicates the in and out node of the edge. The node notation contains two files operated by :.
The first field is the read identifier. The second field is either B or E. B is the
5' end of the read and E is the
3' end of the reads. The next three field indicates the corresponding sequences of the edges. In this example, the
edge in the first line contains the sequence from read
[10841, 28901). If the second coordinate is
smaller than the first one, it means the corresponded sequence is reverse complimented. The next two column are the
number of overlapped base and the overlap identity. The final column is the classification. Currently, there are 4
S. An edge with type
G is used for the final string graph. A
means the edge is transitive reducible.
R means the edge is removed during the local repeat resolution and
means the edge is likely to be a “spur” which only one ends is connected.
The initial string graph is further to be simplified into a set of “unitig” edges. The
utg_data file contains the
details of each unitig. Each line in the file represents a unitig. The first three fields are “start node”, “via node”,
and “end node”. Two untigs might have the same “start node” and “end node”, so we need another “via node” to uniquely
identify the unitigs. Here is an example of the utg_data files:
$ head -10 utg_data 000015696:B 000009028:B 000016941:B contained 16438 134865 000015696:B~000006612:B~000002456:B~000014643:B~000007407:B~000015939:E~000009028:B~000016941:B 000010623:B 000015633:B 000014991:B contained 30158 18666 000010623:B~000015633:B~000014991:B 000015636:B 000002245:B 000010757:E contained 15402 40356 000015636:B~000002245:B~000010757:E 000014184:E NA 000012028:E compound 14895 56928 000014184:E~000012765:E~000012028:E|000014184:E~000007953:B~000012028:E 000010757:B NA 000015636:E compound 15402 40356 000010757:B~000002245:E~000015636:E|000010757:B~000014783:E~000015636:E 000014184:E 000007953:B 000012028:E contained 14792 32932 000014184:E~000007953:B~000012028:E 000010623:B NA 000014991:B compound 30148 163627 000010623:B~000015633:B~000014991:B|000010623:B~000001407:B~000014991:B 000012028:B 000012765:B 000014184:B contained 19137 56928 000012028:B~000000382:E~000012765:B~000014184:B 000016941:B 000003353:B 000008783:B simple 88381 615439 000016941:B~000003353:B~000010261:B~000011789:E~000017006:B~000016307:B~... 000014991:B 000013790:E 000002926:B simple 392373 2274104 000014991:B~000013790:E~000004614:B~000003329:B~000004898:B~000000461:B~000017105:E~...
The forth field indicates the type of the unitigs, the fifth field is the estimate length of the unitig and the six
field is the total number of overlapped bases in the unitig. There are three kinds of unitigs: “simple”, “contained”,
and “compound”. “Simple” unitigs are those unitigs which are just a simple path (every node has one in- and one
out-edge except the begining and ending nodes of the path.) It is represented by a list of nodes which each node
is separated by ~ characters in the 7th column. The “contained” contigs are simple path but those unitigs are also
part of other “compound” paths. The “compound” unitigs represents bubble-like subgraph in the graph. While it is not
“simple”, it has well defined in- and out- nodes and they are treated as a single unit when the contigs are
constructed. The structure inside a “compound” unitig can be from biological nature or sequencing/alignment errors.
Each edge in the “compound” unitig subgraph are encoded explicitly as a collection of simple contained unitigs in
the 7th column. The contained unitigs within a compound unitig are separated by the
The file ctg_paths encodes the graph for each contig after the unitigs are analyzed and put into contigs. Each line
has 7 columns. The first column is the contig ID. The contig ID are just the serial numbers followed by R or F.
Two contigs with same serial number but different endings are “dual” to each other. Namely, they are constructed
from “dual” edges and they are mostly reverse complemented to each other except near the ends of the contigs.
The second column is the type of contig. If a unitig is circular (the beginning node and the ending node are the
same), then it will be marked as
ctg_circular. Everything else will be
ctg_linear. In some case, even a contig
is marked as
ctg_linear, it can be still a circular contig if the beginning node and the ending node are the
same but it is not a “simple” path. One can detect that by checking the beginning and ending nodes if necessary.
The third field indicates the first unitig in the contig in the form of begin_node~via_node~end_node. The fourth
field is the ending node of the contig. The 5th and 6th fields are the estimated length and the overlapped
based of the contig respectively. The final column are the unitigs in the contig. The three node format unitig
IDs are separated by