These documents refer to an obsolete way of installing and running FALCON. They will remain up for historical context and for individuals still using the older version of FALCON/FALCON_unzip.
The current PacBio Assembly suite documentation which includes new bioconda instructions for installing FALCON, FALCON_unzip and their associated dependencies can be found here pb_assembly
FALCON-Unzip are de novo genome assemblers for PacBio long reads, also known as
single-molecule real-time (SMRT) sequences.
FALCON is a diploid-aware assembler
which follows the hierarchical genome assembly process (HGAP) and is optimized for
large genome assembly (e.g. non-microbial).
FALCON produces a set of primary contigs (a-contigs),
which represent divergent allelic variants. Each a-contig is associated with a homologous
genomic region on an p-contig.
FALCON-Unzip is a true diploid assembler. It takes the contigs from
FALCON and phases the reads based on heterozygous SNPs identified in the initial
assembly. It then produces a set of partially phased primary contigs and fully phased
haplotigs which represent divergent haplotyes.
The hierarchical genome assembly process proceeds in two rounds. The first round of assembly involves the selection of seed reads, or the longest reads in the dataset (user-defined length_cutoff). All shorter reads are aligned to the seed reads, in order to generate consensus sequences with high accuracy. We refer to these as pre-assembled reads but they can also be thought of as “error corrected” reads. During the pre-assembly process, seed reads may be split or trimmed at regions of low read coverage (user-defined min_cov for falcon_sense_option). The performance of the pre-assembly process is captured in the pre-assembly stats file.
In the next round of HGAP, the preads, are aligned to each other and assembled into genomic contigs.
For more complex genomes assembled with
“bubbles” in the contig-assembly graph that result from structural variation between haplotypes may be resolved as associate
and primary contigs. The unzip process will extend haplotype phasing beyond “bubble” regions, increasing the amount of phased
contig sequence. It is important to note that
while individual haplotype blocks are phased, phasing does not extend between haplotigs. Thus, in part C) of the
figure above, haplotig_1 and haplotig_2 may originate from different parental haplotypes. Additional information is
needed to phase the haplotype blocks with each other.
Associate contig IDs contain the name of their primary contig but the precise location of alignment must be determined with third party
tools such as NUCmer. For example, in a
FALCON assembly, 000123F-010-01 is an associated contig to primary contig
000123F. In a
FALCON-Unzip assembly, 000123F_001 is a haplotig of primary contig 000123F.
Below are examples of alignments between associate and primary contigs from
FALCON, and haplotigs and primary contigs
FALCON-Unzip. Alignments were built with NUCmer and visualized with Assemblytics. Precise coordinates
may be obtained with the show-coords utilty from MUMmer.
Choosing an Assembler: HGAP4 vs FALCON vs FALCON-Unzip¶
HGAP4, part of the SMRT Link web-based analysis suite, for genomes of known complexity, no larger than
human (3Gb or
compute resources for your SMRT Link instance will influence performance and feasibility. The assembly
HGAP4 in the SMRT Link GUI (graphical user interface) is identical to
FALCON at the command line, besides
compute resource configuration and minor differences in directory structure. The
HGAP4 pipeline by default includes a round of
which employs the
HGAP4 RESULTS ARE NOT COMPATIBLE WITH
FALCON-Unzip AT THIS TIME!
HGAP4 inputs are a PacBio subread BAM dataset, either Sequel or RSII. The FASTA and FASTQ files output from
HGAP4 are a concatenation of the primary
and associate contigs, which are output from
FALCON as separate files.
Users more comfortable at the command line may use
FALCON for genomes of any size
or complexity. Command line inputs are FASTA files of Sequel or RSII subreads. Command-line
FALCON does not automatically polish the assembly. If a user
wishes, assembly polishing may
be run using the
resequencing pipeline of pbsmrtpipe (available for command-line installation using the SMRT_Link download, see
installation instructions). Resequencing requires PacBio subread BAM inputs.
We recommend the
FALCON-Unzip module for heterozygous or outbred organisms that are diploid or higher ploidy. Users wishing to run
FALCON-Unzip must do so only after running
FALCON on the
HGAP4 IS NOT COMPATIBLE WITH
FALCON-Unzip module requires both FASTA and PacBio BAM inputs for subreads.