lazypipe.pl runs your metagenomic analysis step-by-step.
For example, to run preprocessing, assembling, read realignment, annotation and reporting type:
perl lazypipe.pl -1 data/samples/M15_R1.fq.gz --pipe pre,ass,rea,ann,rep -v
Table1: Lazypipe commandline options
Short | Long | Value | Default | Description |
---|---|---|---|---|
-1 | --read1 | file | Paired-end reads, fastq with forward reads (can be gzipped) | |
-2 | --read2 | file | guess from --read1 | Paired-end reads, fastq with reverse reads |
--se | false | Input reads are SE-reads. Any --read2 file will be ignored | ||
-r | --res | dir | results | Results will be printed to res-dir/sample-dir/ |
-s | --sample | str | --read1 prefix | Results will be printed to res-dir/sample-dir/ |
-t | --numth | num | 8 | Number of threads |
--pre | str | fastp | Use fastp|trimm|none to preprocess reads | |
--ass | str | megahit | Assembler: megahit|spades | |
--ann | str | sans | Homology search used for contig annotation: blastp|sans|centrifuge|minimap | |
--hostgen | file | *.fna file containing host genome. Filtering is turned on by --hostgen file -p flt | ||
--hgtaxid | taxid | NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon | ||
-w | --weights | str | bitscore | Model for abundance estimation: taxacount|bitscore|bitscore2 |
--config | file | config.yaml | Configuration file with default options | |
-v | false | Verbal mode | ||
--clean | false | Delete intermediate files after each step | ||
-p | --pipe | str | main | Comma-separated list of steps to perform, e.g. --pipe pre,flt,ass,ann,realign,sta,pack |
pre|preprocess | Preprocess reads, i.e. filter low quality reads | |||
flt|filter | Filter reads mapping to host genome using --hostgen file | |||
ass|assemble | Assemble reads to contigs | |||
rea|realign | Realign reads to contigs | |||
ann|annotate | Annotate contigs with blastp/sans/centrifuge/minimap against blastp_db/UniProtKB/centrifuge_db/minimap_db | |||
blastv | Annotate virus contigs with blastn against custom virus database (blastn_vi_db). | |||
blastu | Annotate unmapped contigs with blastn against custom virus database (blastn_vi_db) | |||
rep|report | Create reports: abundance/annotation tables + Krona graph + sort contigs by taxa | |||
sta|stats | Create assembly stats + QC plots | |||
pack | Pack results to sample.tar.gz | |||
clean | Clean up all intermediate and temporary files. | |||
main | Run main steps: pre,flt,ass,rea,ann,rep,sta,pack,clean | |||
all | Run all steps |
Default options and additional settings are defined in config.yaml file.
Note that command line options take precedence over options in config.yaml file.
Table 2: Additional options in config.yaml
Option | Value | Description |
---|---|---|
GENERAL PARAMETERS | ||
R_call | str | Rscript or similar for calling R |
hostgen | file | Path to host genome in fasta/fasta.gz format. Set to 0 to switch off hostgen filtering. |
hostgen_taxid | num | NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon |
hostgen_flt_th | num | Minimum alignment score for filtering host genome reads |
min_gene_length | num | Minimum ORF sequence length for reporting/mapping |
min_sans_bits | num | Minimum alignment score for mapping ORFs with SANS |
min_blastp_bits | num | Minimum alignment score for mapping ORFs with BLASTP |
min_cent_bits | num | Minimum alignment score for mapping contigs with Centrifuge |
min_minimap_DPpeak_score | num | Minimum DP alignment score for mapping contigs with Minimap2 |
realign_read_th | num | Minimum alignment score for mapping reads to contigs with BWA MEM |
tail | percent | Remove taxa that correspond to this percentile in abundance estimation. Set to zero to keep all predictions |
trimm_par | str | Trimmomatic parameters. NOTE: please ensure that $TM env var is pointing to Trimmomatic installation root |
fastp_par | str | Fastp parameters |
res | dir | Results directory root. Results will be printed to res-dir/sample-dir/ |
logs | dir | Logs directory root. Logs will be printed to logs-dir/sample-dir/ |
tmpdir | dir | Temporary directory root. Each run will create a designated temporary directory at this location |
keep_tmpdir | 0/1 | Set to 1 to keep temporary directories |
DATABASES | ||
blastp_db | path | Path to local NCBI blastp nr database |
blastn_vi_db | path | Path to local blastn nucleotide virus database. |
blastn_vi_db_url | url | URL to blastn_vi_db resource |
centrifuge_db | path | Path to local Centrifuge nucleotide database: eg h+a+b+v. |
minimap_db | path | Path to local minimap2 nucleotide database |
taxonomy | path | Path to local NCBI taxonomy database. Database will be installed on demand |
taxonomy_update | 0/1 | Set 1 to update NCBI taxonomy db |
taxonomy_update_time | num | NCBI taxonomy update frequency in days |
taxonomy_url | str | URL to NCBI taxonomy (taxdump.tar.gz) |
SNAKEMAKE PARAMETERS | ||
datain | input fastq files | List of input fastq libraries ordered by sample name, e.g. "M15: M15_R1.fastq". Note: only list forward reads, reverse reads are guessed by substituting _R1 suffix with _R2. |
blastn_contigs_vi | 0/1 | Annotate viral contigs with blastn against blastn_vi_db. |
blastn_contigs_un | 0/1 | Annotate unmapped contigs with blastn against blastn_vi_db. |
threads_max | num | Maximum number of threads to use |
For this example, we will use data/samples/M15small_R*.fastq (PE Illumina reads from mink feces env sample).
Run main steps with default options:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main
By default, all results are printed to ./res-dir/sample-dir/, in this case to ./results/M15/:
Assembled contigs and predicted ORFs
File/Directory | Description |
---|---|
contigs | Contigs sorted by taxa |
contigs.fa | Contigs in a single fasta file |
contigs_un.fa | Contigs with no annotation by the main homology search |
contigs_vi.fa | Contigs annotated as virus sequences by the main homology search |
ORFs.gtf | Predicted ORFs in GTF2.2 format |
ORFs.aa.fa | Predicted ORFs as aa sequences |
ORFs.nt.fa | Predicted ORFs as nt sequences |
scaffolds.fa | Scaffolds, if available |
abund_table.xlsx
Abundancies are displayed in separate tables for eukaryot viruses, bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.
For raw abundance data see abund_table.tsv.
Columns in abund_table.xlsx
Column | Description |
---|---|
readn | read pairs assigned to this taxon |
readn_pc | percentage of reads pairs assigned to this taxon |
csum | cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa) |
csumq | confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable) |
contign | contigs assigned to this taxon |
species | species name (NCBI taxonomy) |
species_id | species taxid (NCBI taxonomy) |
genus | genus name |
genus_id | genus taxid |
family | family name |
family_id | family taxid |
contigs_annot.xlsx
Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. Columns displayed depend on the applied homology search (sans/blastp/minimap2).
Running -p blastv will also print blastn annotation for contigs_vi.fa to contigs_vi.annot.xlsx.
Running -p blastu will also print blastn annotation for contigs_un.fa to contigs_un.annot.xlsx.
For raw annotation data see contigs[_un|_vi].annot.tsv.
Key columns in contigs[_un|_vi].annot.xslx:
Column | Description |
---|---|
contig | contig id |
coverage | contig coverage |
length | contig length |
ORF | orf description in start-end:strand format |
sseqid | subject sequence id |
bitscore | alignment score |
qcov | query coverage |
scov | subject coverage |
qlen | query sequence length |
slen | subject sequence length |
pide | percent identity |
lali | alignment length |
desc | subject description |
staxid | assigned taxid |
species | assigned species |
genus | assigned genus |
family | assigned family |
Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.
File | Description |
---|---|
qc.read1.jpeg | Length histogram for forward reads |
qc.read2.jpeg | Length histogram for reverse reads |
qc.contigs.jpeg | Length histogram for contigs |
qc.readsurv.jpeg | Read survival plots |
Snakemake works by declaring the end file you wish to produce.
Start by listing your input fastq files under datain key in config.yaml file. Pretend each file with sample id.
For this example, we will use data/samples/M15small. In your config.yaml type:
datain:
M15: data/samples/M15small_R1.fastq
Run main pipeline steps with default options:
snakemake --cores 8 results/M15.tar.gz
Start by unzipping your source fasta:
gunzip -k results/M15/trimmed_paired*_hostflt.fq.gz
Retrieve all reads mapped to contig k141.100 in sample M15
bin/retrieve_reads -c k141.100 -r results/M15 -v
Retrieve all reads mapped to Mamastrovirus (taxid 1239574) in sample M15:
bin/retrieve_reads -t 1239574 -r results/M15 -v