Caution

These documents refer to an obsolete way of installing and running FALCON. They will remain up for historical context and for individuals still using the older version of FALCON/FALCON_unzip.

Attention

The current PacBio Assembly suite documentation which includes new bioconda instructions for installing FALCON, FALCON_unzip and their associated dependencies can be found here pb_assembly

FALCON Assembler

About FALCON

Overview

FALCON and FALCON-Unzip are de novo genome assemblers for PacBio long reads, also known as single-molecule real-time (SMRT) sequences. FALCON is a diploid-aware assembler which follows the hierarchical genome assembly process (HGAP) and is optimized for large genome assembly (e.g. non-microbial). FALCON produces a set of primary contigs (a-contigs), which represent divergent allelic variants. Each a-contig is associated with a homologous genomic region on an p-contig.

FALCON-Unzip is a true diploid assembler. It takes the contigs from FALCON and phases the reads based on heterozygous SNPs identified in the initial assembly. It then produces a set of partially phased primary contigs and fully phased haplotigs which represent divergent haplotyes.

Detailed Description

The hierarchical genome assembly process proceeds in two rounds. The first round of assembly involves the selection of seed reads, or the longest reads in the dataset (user-defined length_cutoff). All shorter reads are aligned to the seed reads, in order to generate consensus sequences with high accuracy. We refer to these as pre-assembled reads but they can also be thought of as “error corrected” reads. During the pre-assembly process, seed reads may be split or trimmed at regions of low read coverage (user-defined min_cov for falcon_sense_option). The performance of the pre-assembly process is captured in the pre-assembly stats file.

In the next round of HGAP, the preads, are aligned to each other and assembled into genomic contigs.

_images/HGAP.png _images/Fig1.png

For more complex genomes assembled with FALCON, “bubbles” in the contig-assembly graph that result from structural variation between haplotypes may be resolved as associate and primary contigs. The unzip process will extend haplotype phasing beyond “bubble” regions, increasing the amount of phased contig sequence. It is important to note that while individual haplotype blocks are phased, phasing does not extend between haplotigs. Thus, in part C) of the figure above, haplotig_1 and haplotig_2 may originate from different parental haplotypes. Additional information is needed to phase the haplotype blocks with each other.

Associate contig IDs contain the name of their primary contig but the precise location of alignment must be determined with third party tools such as NUCmer. For example, in a FALCON assembly, 000123F-010-01 is an associated contig to primary contig 000123F. In a FALCON-Unzip assembly, 000123F_001 is a haplotig of primary contig 000123F.

Below are examples of alignments between associate and primary contigs from FALCON, and haplotigs and primary contigs from FALCON-Unzip. Alignments were built with NUCmer and visualized with Assemblytics. Precise coordinates may be obtained with the show-coords utilty from MUMmer.

_images/dotplots.png

Choosing an Assembler: HGAP4 vs FALCON vs FALCON-Unzip

HGAP4

We recommend HGAP4, part of the SMRT Link web-based analysis suite, for genomes of known complexity, no larger than human (3Gb or smaller), although underlying compute resources for your SMRT Link instance will influence performance and feasibility. The assembly process for HGAP4 in the SMRT Link GUI (graphical user interface) is identical to FALCON at the command line, besides differences in compute resource configuration and minor differences in directory structure. The HGAP4 pipeline by default includes a round of genome “polishing” which employs the resequencing pipeline.

HGAP4 RESULTS ARE NOT COMPATIBLE WITH FALCON-Unzip AT THIS TIME!

HGAP4 inputs are a PacBio subread BAM dataset, either Sequel or RSII. The FASTA and FASTQ files output from HGAP4 are a concatenation of the primary and associate contigs, which are output from FALCON as separate files.

Command Line

Users more comfortable at the command line may use FALCON for genomes of any size or complexity. Command line inputs are FASTA files of Sequel or RSII subreads. Command-line FALCON does not automatically polish the assembly. If a user wishes, assembly polishing may be run using the resequencing pipeline of pbsmrtpipe (available for command-line installation using the SMRT_Link download, see SMRT_Tools_Reference_Guide for installation instructions). Resequencing requires PacBio subread BAM inputs.

We recommend the FALCON-Unzip module for heterozygous or outbred organisms that are diploid or higher ploidy. Users wishing to run FALCON-Unzip must do so only after running FALCON on the command line. HGAP4 IS NOT COMPATIBLE WITH FALCON-UNZIP! The FALCON-Unzip module requires both FASTA and PacBio BAM inputs for subreads.