Examples
Lazypipe Usage and Examples

Running Lazypipe on CSC

Lazypipe is available as a preinstalled module on Puhti server at the Finnish Center for Scientific Computing (CSC). To start using Lazypipe login to Puhti and type:

module load r-env-singularity
module load biokit
module load lazypipe
sbatch-lazypipe -1 mydata/reads_R1.fastq -2 mydata/reads_R2.fastq --hostgen mydata/host_genome.fna.gz --res resdir --sample sampleid --pipe main

The script will ask you to type in the accounting project, maximum duration of the job, memory reservation (min 4GB X number_of_cores recommended) and the number of cores reserved. This will create and submit a job scipt to the sbatch job system.

Running with lazypipe.pl

For these examples, we will use data/samples/M15small_R*.fastq (PE Illumina reads from mink feces env sample).

Run main analysis steps with default options

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main

Run preprocessing with Trimmomatic

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pre --pre trimm -v

Filter host reads with Neovison vison genome

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz mv GCA_900108605.1_NNQGG.v01_genomic.fna.gz $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz

Run assembling with SPAdes + realign reads to assembly

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ass,rea --ass spades -v

Run annotation with minimap2 + update reports

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ann,rep --ann minimap -t 16 -v

Confirm virus contigs with local blastn

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastv -t 16 -v

Search for viruses in unmapped contigs with local blastn

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastu -t 16 -v

Pack results to *.tar.gz

perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pack -v

Running with Snakemake

Snakemake works by declaring the end file you wish to produce.

Start by listing your input fastq files under datain key in config.yaml file. Pretend each file with sample id.

For this example, we will use data/samples/M15small. In your config.yaml type:

datain:
M15: data/samples/M15small_R1.fastq

Run main steps with default options

snakemake --cores 8 results/M15.tar.gz

Run preprocessing with Trimmomatic. Overwrite any trimmed reads produced by previous runs with --force:

snakemake --config pre="trimm" --cores 8 results/M15/trimmed_paired1.fq --force -p

Run assembling with SPAdes. Overwrite any contigs produced by previous runs with --force:

snakemake --config ass="spades" --cores 8 results/M15/contigs.fa --force -p

Redo annotation with minimap2:

snakemake --config ann="minimap" --cores 16 results/M15/M15.tar.gz --force -p

Confirm viral contigs with local blastn:

snakemake --config blastv=1 --cores 16 results/M15/contigs_vi.annot.xslx -p

Search for viruses in unmapped contigs with local blastn

snakemake --config blastu=1 --cores 16 results/M15/contigs_un.annot.xlsx -p

Analyzing SARS2 SRA data

In this example we will analyze public Illumina HiSeq/MiSeq libraries sequenced from five patients at the early stage of SARS2 outbreak in Wuhan, China. For more information see NCBI BioProject PRJNA605983. This example is written for CSC Puhti environment, other Unix environments follow similar steps.

Download data

In this example we will use NCBI SRA Toolkit to download NGS libraries. SRA Toolkit is available on CSC as part of the biokit module. Other users can install SRA Toolkit from NCBI website.

Start by configuring SRA Toolkit with vdb-config utility (included in the kit). Set SRA Toolkit download directory to $data/sra/ or any other convenient location:

module load biokit
vdb-config -i

Download any SRA library for project PRJNA605983 (NCBI accession numbers SRR11092056-SRR11092064). In the following example code we will use SRR11092062 sequenced from sample WIV04-2. After downloading dump fastq files to $data/sra/reads or any other convenient location (set $data env var or substitute for your location)

module load biokit
prefetch SRR11092062
mkdir $data/sra/reads/
fasterq-dump --split-files --outdir $data/sra/reads/

Setup your config.yaml:

datain: sars2_wiv04_2: "$data/sra/reads/SRR11092062_1.fastq" hostgen_sm: sars2_wiv04_2: "$data/hostgen/GCA_000001405.15_GRCh38_genomic.fna.gz" hostgen_taxid_sm: sars2_wiv04_2: 9606

If you are using slurm, create a sbatch execution file, example2.bash:

#SBATCH --job-name=sars2_wiv04_2 #SBATCH --account=my_project #SBATCH --time=06:00:00 #SBATCH --mem-per-cpu=4G #SBATCH --cpus-per-task=32 #SBATCH --partition=small conda activate lazypipe snakemake -p --cores $SLURM_CPUS_PER_TASK results/sars2_wiv04_2.tar.gz

Execute by calling sbatch:

sbatch example2.bash

Result summary

The following table displays virus abundancies reported by Lazypipe for SARS positive libraries ( PRJNA605983).

SRA run SRA experiment Platform Library Virus Taxid readn readn% csumq contign
SRR11092063 SRX7730880 RNA-Seq Illumina HiSeq 3000 WIV02-2 Severe acute respiratory syndrome-related coronavirus 694009 559 0.3685% 1 23
SRR11092057 SRX7730886 RNA-Seq Illumina MiSeq WIV04 Severe acute respiratory syndrome-related coronavirus 694009 732 13.0878% 1 15
SRR11092062 SRX7730881 RNA-Seq Illumina HiSeq 1000 WIV04-2 Severe acute respiratory syndrome-related coronavirus 694009 5918 3.0027% 1 1
SRR11092062 SRX7730881 RNA-Seq Illumina HiSeq 1000 WIV04-2 Influenza A virus 11320 274 0.1390% 1 2
SRR11092062 SRX7730881 RNA-Seq Illumina HiSeq 1000 WIV04-2 Autographa californica multiple nucleopolyhedrovirus 307456 205 0.1040% 1 2
SRR11092061 SRX7730882 RNA-Seq Illumina HiSeq 3000 WIV05 Severe acute respiratory syndrome-related coronavirus 694009 234 0.0510% 1 20
SRR11092061 SRX7730882 RNA-Seq Illumina HiSeq 3000 WIV05 Saccharomyces 20S RNA narnavirus 186772 135 0.0294% 2 1
SRR11092060 SRX7730883 RNA-Seq Illumina HiSeq 3000 WIV06-2 Severe acute respiratory syndrome-related coronavirus 694009 525 0.1417% 1 22
SRR11092060 SRX7730883 RNA-Seq Illumina HiSeq 3000 WIV06-2 Spodoptera frugiperda rhabdovirus 1481139 165 0.0445% 1 1
SRR11092060 SRX7730883 RNA-Seq Illumina HiSeq 3000 WIV06-2 Saccharomyces 20S RNA narnavirus 186772 103 0.0278% 2 3
SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Influenza A virus 11320 9063 0.0974% 1 4
SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Saccharomyces 20S RNA narnavirus 186772 3386 0.0364% 1 1
SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Severe acute respiratory syndrome-related coronavirus 694009 819 0.0088% 2 16
SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Bamboo mosaic virus 35286 325 0.0035% 2 1
SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Spodoptera frugiperda rhabdovirus 1481139 168 0.0018% 2 1