These documents refer to an obsolete way of installing and running FALCON. They will remain up for historical context and for individuals still using the older version of FALCON/FALCON_unzip.
The current PacBio Assembly suite documentation which includes new bioconda instructions for installing FALCON, FALCON_unzip and their associated dependencies can be found here pb_assembly
A FALCON job can be broken down into 3 steps:
- Overlap detection and error correction of rawreads
- Overlap detection between corrected reads
- String Graph assembly of corrected reads
Each step is performed in it’s own subdirectory within the FALCON job
falcon_job/ ├── 0-rawreads/ # Raw read error correction directory ├── 1-preads_ovl/ # Corrected read overlap detection ├── 2-asm-falcon/ # String Graph Assembly ├── mypwatcher/ # Job scheduler logs ├── scripts/ └── sge_log/ # deprecated
The assembly process is driven by the script fc_run.py which should be sent to the scheduler or run on a head node as it needs to persist throughout the entire assembly process. It takes as input a single config file typically named fc_run.cfg, which references a list of fasta input files. The config file can be configured to run locally, or submit to a job scheduler. However, if your dataset is anything larger than a bacterial sized genome and unless you’ve tuned your system specifically for the organism you’re trying to assemble, then most likely you should be running on a cluster in order to more effectively leverage your computational resources.
The configuration file also allows you to control other aspects of your job such as how your compute resources are distributed as well as set many parameters to help you reach an “optimized” assembly according to the nature of your input data. Unfortunately at this point there is no “magic” way to auto-tune the parameters so you should probably spend some time in the Configuration section to understand what options are available to you. Some example configuration files can be found here
Step 1: Overlap detection and error correction of raw reads¶
The first step of the pipeline is to identify all overlaps in the raw reads. Currently this is performed with a modified version of Gene Myers’ DALIGNER.
In order to identify overlaps, your raw reads must first be converted from fasta format into a dazzler
database. This is a very I/O intensive process and will be run from the node where
fc_run.py was executed. If this
is an issue, you should submit the command with a wrapper script to your grid directly.
Once the database has been created and partitioned according to the parameters set in your
fc_run.cfg, an all vs all comparison of the reads must be performed. Accordingly, due to the
all vs all nature of the search this is the most time consuming step in the assembly process. To walk through the
actual steps of this part of the pipeline you should take a look at
Essentially it consists of running:
- fasta2DB to format the database
- DBsplit to partition the database
- HPC.daligner to generate the daligner commands necessary for all-vs-all comparison
After overlaps have been detected, you will be left with many
job_* directories full of alignment files
containing the information about the overlaps. After merging the alignment files (see
m_* directories), the
next step is to error correct the reads leveraging the overlap information. In the
0-rawreads/preads directory you
will find a series of scripts for
performing the error correction. The process basically consists of using
LA4Falcon with a length cutoff and piping the
output to fc_consensus.py to generate a fasta file with corrected reads.
0-rawreads/ ├── job_* # dirs for all of the daligner jobs ├── m_*/ # dirs for all of the LA4Merge jobs ├── preads/ # sub-dir for preads generation ├── report/ # pre-assembly stats ├── cns-scatter/ # dir of scripts for falcon-consensus jobs ├── daligner-scatter/ # dir of scripts for daligner jobs ├── merge-scatter/ # dir of scripts for LAMerge jobs ├── merge-gather/ # dir of scripts for gathering LAMerge inputs ├── raw-gather/ # dir of scripts for gathering daligner jobs for merging ├── input.fofn # list if your input *.fasta files ├── length_cutoff # text file with length cutoff for seed reads ├── pwatcher.dir # dir of individual pipeline jobs stderr and stdout ├── prepare_rdb.sh # env wrapper script ├── raw_reads.db # dazzler DB file ├── raw-fofn-abs # dir of scripts for gathering raw reads inputs ├── rdb_build_done # database construction sentinel file ├── run_jobs.sh # listing of all overlap step commands ├── run.sh # masker job script ├── run.sh.done # sentinel file for all jobs ├── task.json # json file specifying inputs, outputs, and params └── task.sh # script to run json file
The following parameters affect this step directly:
Step 2: Overlap detection of corrected reads¶
The only conceptual difference between the first and second overlap steps is that consensus calling is
not performed in the second step. After pread overlap detection, it’s simply a
matter of extracting the information from the corrected reads database with
DB2Falcon -U preads.
Depending on how well the error-correction step proceeded as well as the how much initial coverage was fed into the pipeline (e.g. length_cutoff), the input data for this step should be significantly reduced and thus, the second overlap detection step will proceed significantly faster.
The commands in this step of the pipeline are very similar to before albeit with different parameter settings to account
for the reduced error-rate of the preads. See the driver script
details on actual parameter settings used.
1-preads_ovl/ ├── job_*/ # directories for daligner jobs ├── m_*/ # directories for LA4Merge jobs ├── db2falcon/ # dir of scripts for formatting preads for falcon ├── gathered-las/ # dir of scripts for gathering daligner jobs ├── merge-gather/ # dir of scripts for gathering LAMerge inputs ├── merge-scatter/ # dir of scripts for LAMerge jobs ├── daligner-scatter/ # dir of scripts for daligner jobs ├── pdb_build_done # sentinel file for pread DB building ├── preads.db # preads dazzler DB ├── prepare_pdb.sh # env wrapper script ├── pwatcher.dir # dir of individual pipeline jobs stderr and stdout ├── run_jobs.sh # listing of all pread overlap job commands ├── run.sh # masker job script ├── run.sh.done # sentinel file for all jobs ├── task.json # json file specifying inputs, outputs, and params └── task.sh # script to run json file
The following parameters affect this step directly:
Step 3: String Graph assembly¶
The final step of the FALCON Assembly pipeline is generation of the final String Graph assembly and output of contig sequences in fasta format. Four commands are run in the final phase of FALCON:
- fc_ovlp_filter - Filters overlaps based on the criteria provided in fc_run.cfg
- fc_ovlp_to_graph - Constructs an overlap graph of reads larger than the length cutoff
- fc_graph_to_contig - Generates fasta files for contigs from the overlap graph.
- fc_dedup_a_tigs - Removes duplicate associated contigs
You can see the details on the parameters used by inspecting
This step of the pipeline is very fast relative to the overlap detection steps. Sometimes it may be useful to run
several iterations of this step with different parameter settings in order to identify a “best” assembly.
The final output of this step is a fasta file of all of the primary contigs,
p_ctg.fa as well as an associated contig
a_ctg.fa that consists of all of the structural variants from the primary contig assembly.
2-asm-falcon/ ├── a_ctg_all.fa # all associated contigs, including duplicates ├── a_ctg_base.fa # ├── a_ctg_base_tiling_path # ├── a_ctg.fa # De-duplicated associated fasta file ├── a_ctg_tiling_path # tiling path informaiton for each associated contig ├── falcon_asm_done # FALCON Assembly sentinal file ├── p_ctg.fa # Fasta file of all primary contigs ├── p_ctg_tiling_path # Tiling path of preads through each primary contig ├── c_path # ├── ctg_paths # corrected read paths for each contig ├── fc_ovlp_to_graph.log # logfile for process of converting overlaps to assembly graph ├── utg_data # ├── sg_edges_list # list of all edges ├── chimers_nodes # ├── preads.ovl # List of all overlaps between preads ├── run_falcon_asm.sh # env wrapper script ├── task.json # json file specifying inputs, outputs, and params ├── task.sh # script to run json file ├── run.sh.done # sentinel file for all jobs └── run.sh # Assembly driver script
The following parameters affect this step directly:
FALCON_unzip operates from a completed FALCON job directory. After tracking the raw reads to contig, A FALCON_unzip job can be broken down into 3 steps
3-unzip/ ├── 0-phasing/ # Contig phasing jobs ├── 1-hasm/ # Contig Graph assembly information ├── read_maps/ # rawread_to_contigs; read_to_contig_map ├── reads/ # raw read fastas for each contig ├── all_p_ctg.fa # partially phased primary contigs ├── all_h_ctg.fa # phased haplotigs ├── all_p_ctg_edges # primary contig edge list ├── all_h_ctg_edges # haplotig edge list ├── all_h_ctg_ids # haplotig id index └── all_phased_reads # table of all phased raw reads
Step 1: Identify SNPs and assign phases¶
0-phasing/ you vill find a number of directories for each contig. Each contains the scripts
to map the raw reads to the contigs and subsequently identify SNPs. The generated SNP tables can
subsequently be used to assign phases to reads.
Step 2: Graph annotation and haplotig¶
1-hasm/ you can find the driver script
hasm.sh which contains the commands necessary to
filter overlaps and traverse the assembly graph paths and subsequently output phased contig sequence.
Assembly Graphs for each contig as well as fasta files for the partially phased primary contigs and fully phased
haplotigs can be found in each
Step 3: Call Consensus (Optional)¶
FALCON_unzip pipeline can optionally be used to run quiver and call high quality consensus. This step
takes as input the primary contig and haplotig sequences output in the previous step. For convenience, these files
have all been concatenated together into
The final consensus output can be found in
In order to run the consensus step as part of the FALCON_unzip pipeline, You need to provide the input_bam_fofn
fc_unzip.cfg option in order for this to work.