User guide

Running Metagenomic Analysis with Lazypipe
Running Lazypipe

For up-to-date User Guides please see Lazypipe Wiki:

Example 1

In this example we will use a sample PE library that is included with the repository (data/samples/M15small_R*.fastq).

Preprocess reads with fastp:

perl -1 data/samples/M15small_R1.fastq --pipe pre -t 8 -v

Download Neovison vison genome and use it to filter host reads. Note that running host filtering with a newly downloaded genome will take some time to index the genome:

mkdir -p $data/hostgen
wget… -P $data/hostgen/
perl -1 data/samples/M15small_R1.fastq --pipe flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v

Run assembling with Megahit and realign reads to assembly

perl -1 data/samples/M15small_R1.fastq -p ass,rea --ass megahit -t 8 -v

Run 1st round annotation with Minimap2 against your local minimap.refseq database:

perl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq -t 8 -v

Run 1st round annotation with SANSparallel against UniProt TrEMBL. Note that SANSparallel runs on a remote server and requires internet connection. Append results to Minimap2 annotations from the previous step:

perl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 sans --append -t 8 -v

Now run a more complex 1st round annotation. Start by mapping contigs with Minimap2, then map unmapped contigs with SANSparallel, then map unmapped contigs with BLASTN against database. Note that without --append flag this will overwrite existing 1st round annotations:

perl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq,sans, -t 8 -v

Run 2nd round annotation. In the second round you can target archaeal+bacterial (=ab), bacteriophage (=ph), viral (=vi) and unmapped (=un) contigs, based on labeling from the 1st round. Local databases for the 2nd round annotations are defined in ann2.databases section of the config.yaml. For example, to map viral contigs with BLASTN and BLASTP against local viral databases type:

perl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2, -t 8 -v

Run 2nd round annotation for bacteria with BLASTN. Append results to BLASTN and BLASTP annotations from the previous step:

perl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq --append -t 8 -v

You can also combine these runs in any order. For example:

perl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq,, -t 8 -v

The most common combinations of 1st and 2nd round annotations can be saved to config.yaml in the ann.strategies section. Each annotation strategy is saved as a key-value pair. There are several annotation strategies predifined:

  • -- run only the 1st round with Minimap2 against RefSeq.abv
  • abv.nt -- 1st round: Minimap2 against NT.abv, 2nd round: BLASTN viral reads against and archaeal+bacterial reads against NT.ab
  • abv.refseq -- 1st round: Minimap2 against RefSeq.abv, 2nd round: BLASTN viral reads against and archaeal+bacterial reads against RefSeq.ab
  • abv.extend -- 1st round: Minimap2 against NT.abv + SANSparallel unmapped reads against TrEMBL, 2nd round: BLASTN viral reads against and archaeal+bacterial reads against NT.ab, additionally BLASTP viral reads against and archaeal+bacterial reads against UniRef100.ab
  • vi.nt -- 1st round: Minimap2 against, 2nd round: BLASTN viral reads against
  • vi.refseq -- 1st round: Minimap2 against, 2nd round: BLASTN viral reads against

Generate reports based on created annotations:

perl -1 data/samples/M15small_R1.fastq --pipe rep -t 8 -v

Generate assembly stats, pack for sharing and remove temporary files:

perl -1 data/samples/M15small_R1.fastq -p stats,pack,clean -t 8 -v

For convenience, routine analysis steps (pre,flt,ass,rea,ann1,ann2,rep,sta,pack,clean) can be called with maintag. To run main analysis with normal annotation strategy type:

perl -1 data/samples/M15small_R1.fastq -p main --anns norm -t 8 -v

Example 1: generated reports

Results are output to $res/$sample. Default value for $res is set in config.yaml and default value for $sample is created from the name of the input reads. These can be changed during runtime with --res mydir --sample mysample.


Assembled contigs and predicted ORFs

File or Directory Description
contigs contigs sorted by taxa
contigs.fa contigs in a single fasta file
contigs.ann1.ab.fa archaeal+bacterial contigs (based on 1st round annotation) bacteriophage contigs (1st round) viral contigs (1st round)
contigs.ann1.un.fa unmapped contigs (1st round)
contigs.ann2.ab.fa archaeal+bacterial contigs (2nd round) bacteriophage contigs (2nd round) viral contigs (2nd round)
contigs.ann2.un.fa unmapped contigs (2nd round)
contigs.orfs.aa.fa predicted ORFs as aa sequences
contigs.orfs.nt.fa predicted ORFs as nt sequences
scaffolds.fa scaffolds, if available

Table 1: Lazypipe results: contigs and ORFs.


Taxon Abundancies

Spreadsheets with taxon abundancies are printed to abund_table.xlsx. Abundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.

For raw abundance data see abund_table.tsv.

column description
readn read pairs assigned to this taxon
readn_pc percentage of reads pairs assigned to this taxon
csum cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa)
csumq confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable)
contign contigs assigned to this taxon
species species name (NCBI taxonomy)
species_id species taxid (NCBI taxonomy)
genus genus name
genus_id genus taxid
family family name
family_id family taxid

Table 2: Columns in abund_table.xlsx


Contig Annotations

Spreadsheets with contig annotations are printed to contig_annot.xslx. Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.

For raw annotation data see contigs_annot.tsv. 

column description
search applied database search (e.g. blastn)
db applied database (e.g.
dbtype nucl for nucleotide and prot for protein databases
contig contig id
orf orf description in start-end:strand format
clen contig length
sseqid subject sequence id
bitscore alignment score
alen alignment length
pident percent identity
qlen query sequence length
qcov query coverage
slen subject sequence length
scov subject coverage
staxid subject sequence taxid
sname subject sequence name
bphage yes for bacteriophage staxids
species assigned species
genus assigned genus
family assigned family
order assigned order
class assigned class

Table 3: Columns in contigs_annot.xslx


Krona Graph and Quality Control Plots

Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.

file description
qc.read1.jpeg length hist for forward reads
qc.read2.jpeg length hist for reverse reads
qc.contigs.jpeg length hist for contigs
qc.readsurv.jpeg read survival plots

Table 4: Quality Control plots