Caution

These documents refer to an obsolete way of installing and running FALCON. They will remain up for historical context and for individuals still using the older version of FALCON/FALCON_unzip.

Attention

The current PacBio Assembly suite documentation which includes new bioconda instructions for installing FALCON, FALCON_unzip and their associated dependencies can be found here pb_assembly

Parameters¶

Configuration¶

Here are some example fc_run.cfg and fc_unzip.cfg files. We make no guarantee that they will work with your dataset and cluster configuration. We merely provide them as starting points that have proven themselves on internal datasets. A lot of your success will depend purely on the quality of the input data prior to even engaging the FALCON pipeline. Also, these particular configs were designed to work in our SGE compute cluster, so some tuning will likely be necessary on your part. You should consult with your HPC administrator to assist in tuning to your cluster.

FALCON Parameter sets¶

fc_run_fungal.cfg - Has worked well on a 40Mb fungal genome

fc_run_human.cfg - Has worked well on at least one human dataset

fc_run_bird.cfg - Has worked well on at least one avian dataset

fc_run_yeast.cfg - Has worked well on at least one yeast dataset

fc_run_dipteran.cfg - Has worked well on at least one dipteran (insect) dataset

fc_run_mammal.cfg - Has worked well on at least one mammalian dataset

fc_run_mammalSequel.cfg - Has worked well on at least one mammalian Sequel dataset

fc_run_plant.cfg - Has worked well on at least one plant (Ranunculales) dataset

fc_run_arabidopsis.cfg - Configuration for arabidopsis assembly in Chin et al. 2016

fc_run_ecoli.cfg - Configuration for test E. coli dataset

fc_run_ecoli_local.cfg - Configuration for test E. coli dataset run locally

FALCON_unzip Parameter sets¶

fc_unzip.cfg - General all purpose unzip config

Available Parameters¶

fc_run.cfg¶

input_fofn <str>: filename for the file-of-filenames (fofn) Each line is fasta filename. Any relative paths are relative to the location of the input_fofn.

input_type <str>: “raw” or “preads”

genome_size <int>: estimated number of base-pairs in haplotype

seed-coverage <int>: requested coverage for auto-calculated cutoff

length_cutoff <int>: Raw reads shorter than this cutoff won’t be considered in the assembly process. If ‘-1’, then auto-calculate the cutoff based on genome_size and seed_coverage.

length_cutoff_pr <int>: minimum length of seed-reads used after pre-assembly, for the “overlap” stage

target <str>: “assembly” or “preads” If “preads”, then pre-assembly stage is skipped and input is assumed to be preads.

default_concurrent_jobs <int>: maximum concurrency This applies even to “local” (non-distributed) jobs.

pa_concurrent_jobs <str>: Concurrency settings for pre-assembly

cns_concurrent_jobs <str>

Concurrency settings for consensus calling

One can use cns_concurrent_jobs to control the maximum number of concurrent consensus jobs submitted to the job management system. The out.XXXXX.fasta files produced are used as input for the next step in the pipeline.

ovlp_concurrent_jobs <str>: Concurrency settings for Overlap detection

job_type <str>: grid submission system, or “local” Supported types include: “sge”, “lsf”, “pbs”, “torque”, “slurm”, “local” case-insensitive

job_queue <str>: grid job-queue name Can be overridden with section-specific sge_option_*

sge_option_da <str>: Grid concurrency settings for initial daligner steps 0-rawreads/

sge_option_la <str>: Grid concurrency settings for initial las-merging 0-rawreads/

sge_option_cns <str>: Grid concurrency settings for error correction consensus calling

sge_option_pda <str>: Grid concurrency settings for daligner on preads 1-preads_ovl/

sge_option_pla <str>: Grid concurrency settings for las-merging on preads in 1-preads_ovl/

sge_option_fc <str>: Grid concurrency settings for stage 2 in 2-asm-falcon/

pa_DBdust_option <str>: Passed to DBdust. Used only if dust = true.

pa_DBsplit_option <str>: Passed to DBsplit during pre-assembly stage.

pa_HPCdaligner_option <str>

Passed to HPC.daligner during pre-assembly stage. We will add -H based on``length_cutoff``.

The -dal option also controls the number of jobs being spawned. The number for the -dal option determines how many blocks are compared to each in single jobs. Having a larger number will spawn a fewer number of larger jobs, while the opposite will give you a larger number of small jobs. This will depend on your on your compute resources available.

In this workflow, the trace point generated by daligner is not used. ( Well, to be efficient, one should use the trace points but one have to know how to pull them out correctly first. ) The -s1000 argument makes the trace points sparse to save some disk space (not much though). We can also ignore all reads below a certain threshold by specifying a length cutoff with -l1000.

The biggest difference between this parameter and the ovlp_HPCdaligner_option parameter is that the latter needs to have a relaxed error rate switch -e as the alignment is being performed on uncorrected reads.

pa_dazcon_option <str>: Passed to dazcon. Used only if dazcon = true.

falcon_sense_option <str>: Passed to fc_consensus. Ignored if dazcon = true.

falcon_sense_skip_contained <str>: Causes -s to be passed to LA4Falcon. Rarely needed.

ovlp_DBsplit_option <str>: Passed to DBsplit during overlap stage.

ovlp_HPCdaligner_option <str>: Passed to HPC.daligner during overlap stage.

overlap_filtering_setting <str>: Passed to fc_ovlp_filter during assembly stage.

fc_ovlp_to_graph_option <str>: Passed to fc_ovlp_to_graph.

skip_check <bool>: If “true”, then skip LAcheck during LAmerge/LAsort. (Actually, LAcheck is run, but failures are ignored.) When daligner bugs are finally fixed, this will be unnecessary.

dust <bool>: If true, then run DBdust before pre-assembly.

dazcon <bool>: If true, then use dazcon (from pbdagcon repo).

stop_all_jobs_on_failure <bool>: DEPRECATED This was used for the old pypeFLOW refresh-loop, used by run0.py. (This is not the option to let jobs currently in SGE (etc) to keep running, which is still TODO.)

use_tmpdir <bool>: (boolean string) whether to run each job in TMPDIR and copy results back to nfs If “true”, use TMPDIR. (Actually, tempfile.tmpdir. See standard Python docs: https://docs.python.org/2/library/tempfile.html ) If the value looks like a path, then it is used instead of TMPDIR.

fc_unzip.cfg¶

job_type <str>: same as above. grid submission system, or “local” Supported types include: “sge”, “lsf”, “pbs”, “torque”, “slurm”, “local” case-insensitive
input_fofn <str>: This will be the same input file you used in your fc_run.cfg

input_bam_fofn <str>: List of movie bam files. Only necessary if performing consensus calling step at the end.
smrt_bin <str>: path to bin directory containing samtools, blasr, and various GenomicConsensus utilities
jobqueue <str>: Queue to submit SGE jobs to.
sge_phasing <str>: Phasing grid settings. Example: -pe smp 12 -q %(jobqueue)s
sge_quiver <str>: Consensus calling grid settings. Example -pe smp 24 -q %(jobqueue)s
sge_track_reads <str>: Read tracking grid settings. Example -pe smp 12 -q %(jobqueue)s
sge_blasr_aln <str>: blasr alignment grid settings. Example -pe smp 24 -q %(jobqueue)s
sge_hasm <str>: Final haplotyped assemble grid settings Example -pe smp 48 -q %(jobqueue)s
unzip_concurrent_jobs <int>: Number of concurrent unzip jobs to run at a time
quiver_concurrent_jobs <int>: Number of concurrent consensus calling jobs to run