User guide | Lazypipe | University of Helsinki

Running Lazypipe

For up-to-date User Guides please see Lazypipe Wiki:

Example 1

In this example we will use a sample PE library that is included with the repository (data/samples/M15small_R*.fastq).

Preprocess reads with fastp:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe pre -t 8 -v

Download Neovison vison genome and use it to filter host reads. Note that running host filtering with a newly downloaded genome will take some time to index the genome:

mkdir -p $data/hostgen

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG… -P $data/hostgen/

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v

Run assembling with Megahit and realign reads to assembly

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ass,rea --ass megahit -t 8 -v

Run 1st round annotation with Minimap2 against your local minimap.refseq database:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq -t 8 -v

Run 1st round annotation with SANSparallel against UniProt TrEMBL. Note that SANSparallel runs on a remote server and requires internet connection. Append results to Minimap2 annotations from the previous step:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 sans --append -t 8 -v

Now run a more complex 1st round annotation. Start by mapping contigs with Minimap2, then map unmapped contigs with SANSparallel, then map unmapped contigs with BLASTN against blastn.vi database. Note that without --append flag this will overwrite existing 1st round annotations:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq,sans,blastn.vi -t 8 -v

Run 2nd round annotation. In the second round you can target archaeal+bacterial (=ab), bacteriophage (=ph), viral (=vi) and unmapped (=un) contigs, based on labeling from the 1st round. Local databases for the 2nd round annotations are defined in ann2.databases section of the config.yaml. For example, to map viral contigs with BLASTN and BLASTP against local viral databases type:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.vi.refseq,blastp.vi -t 8 -v

Run 2nd round annotation for bacteria with BLASTN. Append results to BLASTN and BLASTP annotations from the previous step:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq --append -t 8 -v

You can also combine these runs in any order. For example:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq,blastn.vi.refseq,blastp.vi -t 8 -v

The most common combinations of 1st and 2nd round annotations can be saved to config.yaml in the ann.strategies section. Each annotation strategy is saved as a key-value pair. There are several annotation strategies predifined:

abv.fast -- run only the 1st round with Minimap2 against RefSeq.abv
abv.nt -- 1st round: Minimap2 against NT.abv, 2nd round: BLASTN viral reads against NT.vi and archaeal+bacterial reads against NT.ab
abv.refseq -- 1st round: Minimap2 against RefSeq.abv, 2nd round: BLASTN viral reads against RefSeq.vi and archaeal+bacterial reads against RefSeq.ab
abv.extend -- 1st round: Minimap2 against NT.abv + SANSparallel unmapped reads against TrEMBL, 2nd round: BLASTN viral reads against NT.vi and archaeal+bacterial reads against NT.ab, additionally BLASTP viral reads against UniRef100.vi and archaeal+bacterial reads against UniRef100.ab
vi.nt -- 1st round: Minimap2 against NT.vi, 2nd round: BLASTN viral reads against NT.vi
vi.refseq -- 1st round: Minimap2 against RefSeq.vi, 2nd round: BLASTN viral reads against RefSeq.vi

Generate reports based on created annotations:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe rep -t 8 -v

Generate assembly stats, pack for sharing and remove temporary files:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p stats,pack,clean -t 8 -v

For convenience, routine analysis steps (pre,flt,ass,rea,ann1,ann2,rep,sta,pack,clean) can be called with maintag. To run main analysis with normal annotation strategy type:

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p main --anns norm -t 8 -v

Example 1: generated reports

Results are output to $res/$sample. Default value for $res is set in config.yaml and default value for $sample is created from the name of the input reads. These can be changed during runtime with --res mydir --sample mysample.

Assembled contigs and predicted ORFs

File or Directory	Description
contigs	contigs sorted by taxa
contigs.fa	contigs in a single fasta file
contigs.ann1.ab.fa	archaeal+bacterial contigs (based on 1st round annotation)
contigs.ann1.ph.fa	bacteriophage contigs (1st round)
contigs.ann1.vi.fa	viral contigs (1st round)
contigs.ann1.un.fa	unmapped contigs (1st round)
contigs.ann2.ab.fa	archaeal+bacterial contigs (2nd round)
contigs.ann2.ph.fa	bacteriophage contigs (2nd round)
contigs.ann2.vi.fa	viral contigs (2nd round)
contigs.ann2.un.fa	unmapped contigs (2nd round)
contigs.orfs.aa.fa	predicted ORFs as aa sequences
contigs.orfs.nt.fa	predicted ORFs as nt sequences
scaffolds.fa	scaffolds, if available

Table 1: Lazypipe results: contigs and ORFs.

Taxon Abundancies

Spreadsheets with taxon abundancies are printed to abund_table.xlsx. Abundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.

For raw abundance data see abund_table.tsv.

column	description
readn	read pairs assigned to this taxon
readn_pc	percentage of reads pairs assigned to this taxon
csum	cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa)
csumq	confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable)
contign	contigs assigned to this taxon
species	species name (NCBI taxonomy)
species_id	species taxid (NCBI taxonomy)
genus	genus name
genus_id	genus taxid
family	family name
family_id	family taxid

Table 2: Columns in abund_table.xlsx

Contig Annotations

Spreadsheets with contig annotations are printed to contig_annot.xslx. Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.

For raw annotation data see contigs_annot.tsv.

column	description
search	applied database search (e.g. blastn)
db	applied database (e.g. UniRef100.vi)
dbtype	nucl for nucleotide and prot for protein databases
contig	contig id
orf	orf description in start-end:strand format
clen	contig length
sseqid	subject sequence id
bitscore	alignment score
alen	alignment length
pident	percent identity
qlen	query sequence length
qcov	query coverage
slen	subject sequence length
scov	subject coverage
staxid	subject sequence taxid
sname	subject sequence name
bphage	yes for bacteriophage staxids
species	assigned species
genus	assigned genus
family	assigned family
order	assigned order
class	assigned class

Table 3: Columns in contigs_annot.xslx

Krona Graph and Quality Control Plots

Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.

file	description
qc.read1.jpeg	length hist for forward reads
qc.read2.jpeg	length hist for reverse reads
qc.contigs.jpeg	length hist for contigs
qc.readsurv.jpeg	read survival plots

Table 4: Quality Control plots