User guide
Running Metagenomic Analysis with Lazypipe
Running Lazypipe with lazypipe.pl

lazypipe.pl runs your metagenomic analysis step-by-step.

For example, to run preprocessing, assembling, read realignment, annotation and reporting type:

perl lazypipe.pl -1 data/samples/M15_R1.fq.gz --pipe pre,ass,rea,ann,rep -v

Table1: Lazypipe commandline options

Short Long Value Default Description
-1 --read1 file   Paired-end reads, fastq with forward reads (can be gzipped)
-2 --read2 file guess from --read1 Paired-end reads, fastq with reverse reads
  --se   false Input reads are SE-reads. Any --read2 file will be ignored
-r --res dir results Results will be printed to res-dir/sample-dir/
-s --sample str --read1 prefix Results will be printed to res-dir/sample-dir/
-t --numth num 8 Number of threads
  --pre str fastp Use fastp|trimm|none to preprocess reads
  --ass str megahit Assembler: megahit|spades
  --ann str sans Homology search used for contig annotation: blastp|sans|centrifuge|minimap
  --hostgen file   *.fna file containing host genome. Filtering is turned on by --hostgen file -p flt
  --hgtaxid taxid   NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon
-w --weights str bitscore Model for abundance estimation: taxacount|bitscore|bitscore2
  --config file config.yaml Configuration file with default options
-v     false Verbal mode
  --clean   false Delete intermediate files after each step
-p --pipe str main Comma-separated list of steps to perform, e.g. --pipe pre,flt,ass,ann,realign,sta,pack
    pre|preprocess   Preprocess reads, i.e. filter low quality reads
    flt|filter   Filter reads mapping to host genome using --hostgen file
    ass|assemble   Assemble reads to contigs
    rea|realign   Realign reads to contigs
    ann|annotate   Annotate contigs with blastp/sans/centrifuge/minimap against blastp_db/UniProtKB/centrifuge_db/minimap_db
    blastv   Annotate virus contigs with blastn against custom virus database (blastn_vi_db).
    blastu   Annotate unmapped contigs with blastn against custom virus database (blastn_vi_db)
    rep|report   Create reports: abundance/annotation tables + Krona graph + sort contigs by taxa
    sta|stats   Create assembly stats + QC plots
    pack   Pack results to sample.tar.gz
    clean   Clean up all intermediate and temporary files.
    main   Run main steps: pre,flt,ass,rea,ann,rep,sta,pack,clean
    all   Run all steps

 

Default options and additional settings are defined in config.yaml file.

Note that command line options take precedence over options in config.yaml file.

Table 2: Additional options in config.yaml

Option Value Description
GENERAL PARAMETERS    
R_call str Rscript or similar for calling R
hostgen file Path to host genome in fasta/fasta.gz format. Set to 0 to switch off hostgen filtering.
hostgen_taxid num NCBI taxon id for the host genome taxon. When given, hostgen filtered reads will be assigned to this taxon
hostgen_flt_th num Minimum alignment score for filtering host genome reads
min_gene_length num Minimum ORF sequence length for reporting/mapping
min_sans_bits num Minimum alignment score for mapping ORFs with SANS
min_blastp_bits num Minimum alignment score for mapping ORFs with BLASTP
min_cent_bits num Minimum alignment score for mapping contigs with Centrifuge
min_minimap_DPpeak_score num Minimum DP alignment score for mapping contigs with Minimap2
realign_read_th num Minimum alignment score for mapping reads to contigs with BWA MEM
tail percent Remove taxa that correspond to this percentile in abundance estimation. Set to zero to keep all predictions
trimm_par str Trimmomatic parameters. NOTE: please ensure that $TM env var is pointing to Trimmomatic installation root
fastp_par str Fastp parameters
res dir Results directory root. Results will be printed to res-dir/sample-dir/
logs dir Logs directory root. Logs will be printed to logs-dir/sample-dir/
tmpdir dir Temporary directory root. Each run will create a designated temporary directory at this location
keep_tmpdir 0/1 Set to 1 to keep temporary directories
DATABASES    
blastp_db path Path to local NCBI blastp nr database
blastn_vi_db path Path to local blastn nucleotide virus database.
blastn_vi_db_url url URL to blastn_vi_db resource
centrifuge_db path Path to local Centrifuge nucleotide database: eg h+a+b+v.
minimap_db path Path to local minimap2 nucleotide database
taxonomy path Path to local NCBI taxonomy database. Database will be installed on demand
taxonomy_update 0/1 Set 1 to update NCBI taxonomy db
taxonomy_update_time num NCBI taxonomy update frequency in days
taxonomy_url str URL to NCBI taxonomy (taxdump.tar.gz)
SNAKEMAKE PARAMETERS    
datain input fastq files List of input fastq libraries ordered by sample name, e.g. "M15: M15_R1.fastq".
Note: only list forward reads, reverse reads are guessed by substituting _R1 suffix with _R2.
blastn_contigs_vi 0/1 Annotate viral contigs with blastn against blastn_vi_db.
blastn_contigs_un 0/1 Annotate unmapped contigs with blastn against blastn_vi_db.
threads_max num Maximum number of threads to use
Example 1: analyzing sample data

For this example, we will use data/samples/M15small_R*.fastq (PE Illumina reads from mink feces env sample).

Run main steps with default options:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main

Example 1: generated reports

By default, all results are printed to ./res-dir/sample-dir/, in this case to ./results/M15/:

Assembled contigs and predicted ORFs

File/Directory Description
contigs Contigs sorted by taxa
contigs.fa Contigs in a single fasta file
contigs_un.fa Contigs with no annotation by the main homology search
contigs_vi.fa Contigs annotated as virus sequences by the main homology search
ORFs.gtf Predicted ORFs in GTF2.2 format
ORFs.aa.fa Predicted ORFs as aa sequences
ORFs.nt.fa Predicted ORFs as nt sequences
scaffolds.fa Scaffolds, if available

abund_table.xlsx

Abundancies are displayed in separate tables for eukaryot viruses, bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.

For raw abundance data see abund_table.tsv.

Columns in abund_table.xlsx

Column Description
readn read pairs assigned to this taxon
readn_pc percentage of reads pairs assigned to this taxon
csum cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa)
csumq confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable)
contign contigs assigned to this taxon
species species name (NCBI taxonomy)
species_id species taxid (NCBI taxonomy)
genus genus name
genus_id genus taxid
family family name
family_id family taxid

contigs_annot.xlsx

Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. Columns displayed depend on the applied homology search (sans/blastp/minimap2).

Running -p blastv will also print blastn annotation for contigs_vi.fa to contigs_vi.annot.xlsx.

Running -p blastu will also print blastn annotation for contigs_un.fa to contigs_un.annot.xlsx.

For raw annotation data see contigs[_un|_vi].annot.tsv.

Key columns in contigs[_un|_vi].annot.xslx:

Column Description
contig contig id
coverage contig coverage
length contig length
ORF orf description in start-end:strand format
sseqid subject sequence id
bitscore alignment score
qcov query coverage
scov subject coverage
qlen query sequence length
slen subject sequence length
pide percent identity
lali alignment length
desc subject description
staxid assigned taxid
species assigned species
genus assigned genus
family assigned family

Krona graph

Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.

File Description
qc.read1.jpeg Length histogram for forward reads
qc.read2.jpeg Length histogram for reverse reads
qc.contigs.jpeg Length histogram for contigs
qc.readsurv.jpeg Read survival plots
Example 2: analyze sample data with Snakemake interface

Snakemake works by declaring the end file you wish to produce.

Start by listing your input fastq files under datain key in config.yaml file. Pretend each file with sample id.

For this example, we will use data/samples/M15small. In your config.yaml type:

datain:
M15: data/samples/M15small_R1.fastq

Run main pipeline steps with default options:

snakemake --cores 8 results/M15.tar.gz

Retrieving reads for a contig or taxid

Start by unzipping your source fasta:

gunzip -k results/M15/trimmed_paired*_hostflt.fq.gz

Retrieve all reads mapped to contig k141.100 in sample M15

bin/retrieve_reads -c k141.100 -r results/M15 -v

Retrieve all reads mapped to Mamastrovirus (taxid 1239574) in sample M15:

bin/retrieve_reads -t 1239574 -r results/M15 -v