For up-to-date User Guides please see Lazypipe Wiki:
In this example we will use a sample PE library that is included with the repository (data/samples/M15small_R*.fastq
).
Preprocess reads with fastp:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe pre -t 8 -v
Download Neovison vison genome and use it to filter host reads. Note that running host filtering with a newly downloaded genome will take some time to index the genome:
mkdir -p $data/hostgen
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -P $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz -t 8 -v
Run assembling with Megahit and realign reads to assembly
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ass,rea --ass megahit -t 8 -v
Run 1st round annotation with Minimap2 against your local minimap.refseq
database:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq -t 8 -v
Run 1st round annotation with SANSparallel against UniProt TrEMBL. Note that SANSparallel runs on a remote server and requires internet connection. Append results to Minimap2 annotations from the previous step:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 sans --append -t 8 -v
Now run a more complex 1st round annotation. Start by mapping contigs with Minimap2, then map unmapped contigs with SANSparallel, then map unmapped contigs with BLASTN against blastn.vi database. Note that without --append
flag this will overwrite existing 1st round annotations:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p ann1 --ann1 minimap.refseq,sans,blastn.vi -t 8 -v
Run 2nd round annotation. In the second round you can target archaeal+bacterial (=ab), bacteriophage (=ph), viral (=vi) and unmapped (=un) contigs, based on labeling from the 1st round. Local databases for the 2nd round annotations are defined in ann2.databases
section of the config.yaml
. For example, to map viral contigs with BLASTN and BLASTP against local viral databases type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.vi.refseq,blastp.vi -t 8 -v
Run 2nd round annotation for bacteria with BLASTN. Append results to BLASTN and BLASTP annotations from the previous step:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq --append -t 8 -v
You can also combine these runs in any order. For example:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe ann2 --ann2 blastn.ab.refseq,blastn.vi.refseq,blastp.vi -t 8 -v
The most common combinations of 1st and 2nd round annotations can be saved to
config.yaml
in the ann.strategies
section. Each annotation strategy is saved as a key-value pair. There are several annotation strategies predifined:
abv.fast
-- run only the 1st round with Minimap2 against RefSeq.abv
abv.nt
-- 1st round: Minimap2 against NT.abv, 2nd round: BLASTN viral reads against NT.vi and archaeal+bacterial reads against NT.ab
abv.refseq
-- 1st round: Minimap2 against RefSeq.abv, 2nd round: BLASTN viral reads against RefSeq.vi and archaeal+bacterial reads against RefSeq.ab
abv.extend
-- 1st round: Minimap2 against NT.abv + SANSparallel unmapped reads against TrEMBL, 2nd round: BLASTN viral reads against NT.vi and archaeal+bacterial reads against NT.ab, additionally BLASTP viral reads against UniRef100.vi and archaeal+bacterial reads against UniRef100.ab
vi.nt
-- 1st round: Minimap2 against NT.vi, 2nd round: BLASTN viral reads against NT.vi
vi.refseq
-- 1st round: Minimap2 against RefSeq.vi, 2nd round: BLASTN viral reads against RefSeq.vi
Generate reports based on created annotations:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq --pipe rep -t 8 -v
Generate assembly stats, pack for sharing and remove temporary files:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p stats,pack,clean -t 8 -v
For convenience, routine analysis steps (
pre,flt,ass,rea,ann1,ann2,rep,sta,pack,clean
) can be called with main
tag. To run main analysis with normal annotation strategy type:
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -p main --anns norm -t 8 -v
Results are output to $res/$sample
. Default value for $res
is set in config.yaml
and default value for $sample
is created from the name of the input reads. These can be changed during runtime with --res mydir --sample mysample
.
File or Directory | Description |
---|---|
contigs | contigs sorted by taxa |
contigs.fa | contigs in a single fasta file |
contigs.ann1.ab.fa | archaeal+bacterial contigs (based on 1st round annotation) |
contigs.ann1.ph.fa | bacteriophage contigs (1st round) |
contigs.ann1.vi.fa | viral contigs (1st round) |
contigs.ann1.un.fa | unmapped contigs (1st round) |
contigs.ann2.ab.fa | archaeal+bacterial contigs (2nd round) |
contigs.ann2.ph.fa | bacteriophage contigs (2nd round) |
contigs.ann2.vi.fa | viral contigs (2nd round) |
contigs.ann2.un.fa | unmapped contigs (2nd round) |
contigs.orfs.aa.fa | predicted ORFs as aa sequences |
contigs.orfs.nt.fa | predicted ORFs as nt sequences |
scaffolds.fa | scaffolds, if available |
Table 1: Lazypipe results: contigs and ORFs.
Spreadsheets with taxon abundancies are printed to abund_table.xlsx
. Abundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.
For raw abundance data see abund_table.tsv
.
column | description |
---|---|
readn | read pairs assigned to this taxon |
readn_pc | percentage of reads pairs assigned to this taxon |
csum | cumulative read distribution score (percentage of reads mapped to this taxon and more abundant taxa) |
csumq | confidences score based on csum (1 ~ reliable, 2 ~ intermediate, 3 ~ unreliable) |
contign | contigs assigned to this taxon |
species | species name (NCBI taxonomy) |
species_id | species taxid (NCBI taxonomy) |
genus | genus name |
genus_id | genus taxid |
family | family name |
family_id | family taxid |
Table 2: Columns in abund_table.xlsx
Spreadsheets with contig annotations are printed to contig_annot.xslx
. Spreadsheets are displayed separately for viruses (excluding bacteriophages), bacteria, bacteriophages and eukaryots.
For raw annotation data see contigs_annot.tsv
.
column | description |
---|---|
search | applied database search (e.g. blastn) |
db | applied database (e.g. UniRef100.vi) |
dbtype | nucl for nucleotide and prot for protein databases |
contig | contig id |
orf | orf description in start-end:strand format |
clen | contig length |
sseqid | subject sequence id |
bitscore | alignment score |
alen | alignment length |
pident | percent identity |
qlen | query sequence length |
qcov | query coverage |
slen | subject sequence length |
scov | subject coverage |
staxid | subject sequence taxid |
sname | subject sequence name |
bphage | yes for bacteriophage staxids |
species | assigned species |
genus | assigned genus |
family | assigned family |
order | assigned order |
class | assigned class |
Table 3: Columns in contigs_annot.xslx
Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track retained reads after each pipeline step.
file | description |
---|---|
qc.read1.jpeg | length hist for forward reads |
qc.read2.jpeg | length hist for reverse reads |
qc.contigs.jpeg | length hist for contigs |
qc.readsurv.jpeg | read survival plots |
Table 4: Quality Control plots