Lazypipe is available as a preinstalled module on Puhti server at the Finnish Center for Scientific Computing (CSC). To start using Lazypipe login to Puhti and type:
module load r-env-singularity
module load biokit
module load lazypipe
sbatch-lazypipe -1 mydata/reads_R1.fastq -2 mydata/reads_R2.fastq --hostgen mydata/host_genome.fna.gz --res resdir --sample sampleid --pipe main
The script will ask you to type in the accounting project, maximum duration of the job, memory reservation (min 4GB X number_of_cores recommended) and the number of cores reserved. This will create and submit a job scipt to the sbatch job system.
For these examples, we will use data/samples/M15small_R*.fastq (PE Illumina reads from mink feces env sample).
Run main analysis steps with default options
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p main
Run preprocessing with Trimmomatic
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pre --pre trimm -v
Filter host reads with Neovison vison genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz mv GCA_900108605.1_NNQGG.v01_genomic.fna.gz $data/hostgen/
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p flt --hostgen $data/hostgen/GCA_900108605.1_NNQGG.v01_genomic.fna.gz
Run assembling with SPAdes + realign reads to assembly
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ass,rea --ass spades -v
Run annotation with minimap2 + update reports
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p ann,rep --ann minimap -t 16 -v
Confirm virus contigs with local blastn
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastv -t 16 -v
Search for viruses in unmapped contigs with local blastn
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p blastu -t 16 -v
Pack results to *.tar.gz
perl lazypipe.pl -1 data/samples/M15small_R1.fastq -s M15 -p pack -v
Snakemake works by declaring the end file you wish to produce.
Start by listing your input fastq files under datain key in config.yaml file. Pretend each file with sample id.
For this example, we will use data/samples/M15small. In your config.yaml type:
datain:
M15: data/samples/M15small_R1.fastq
Run main steps with default options
snakemake --cores 8 results/M15.tar.gz
Run preprocessing with Trimmomatic. Overwrite any trimmed reads produced by previous runs with --force:
snakemake --config pre="trimm" --cores 8 results/M15/trimmed_paired1.fq --force -p
Run assembling with SPAdes. Overwrite any contigs produced by previous runs with --force:
snakemake --config ass="spades" --cores 8 results/M15/contigs.fa --force -p
Redo annotation with minimap2:
snakemake --config ann="minimap" --cores 16 results/M15/M15.tar.gz --force -p
Confirm viral contigs with local blastn:
snakemake --config blastv=1 --cores 16 results/M15/contigs_vi.annot.xslx -p
Search for viruses in unmapped contigs with local blastn
snakemake --config blastu=1 --cores 16 results/M15/contigs_un.annot.xlsx -p
In this example we will analyze public Illumina HiSeq/MiSeq libraries sequenced from five patients at the early stage of SARS2 outbreak in Wuhan, China. For more information see NCBI BioProject PRJNA605983. This example is written for CSC Puhti environment, other Unix environments follow similar steps.
In this example we will use NCBI SRA Toolkit to download NGS libraries. SRA Toolkit is available on CSC as part of the biokit module. Other users can install SRA Toolkit from NCBI website.
Start by configuring SRA Toolkit with vdb-config utility (included in the kit). Set SRA Toolkit download directory to $data/sra/ or any other convenient location:
module load biokit
vdb-config -i
Download any SRA library for project PRJNA605983 (NCBI accession numbers SRR11092056-SRR11092064). In the following example code we will use SRR11092062 sequenced from sample WIV04-2. After downloading dump fastq files to $data/sra/reads or any other convenient location (set $data env var or substitute for your location)
module load biokit
prefetch SRR11092062
mkdir $data/sra/reads/
fasterq-dump --split-files --outdir $data/sra/reads/
Setup your config.yaml:
datain: sars2_wiv04_2: "$data/sra/reads/SRR11092062_1.fastq" hostgen_sm: sars2_wiv04_2: "$data/hostgen/GCA_000001405.15_GRCh38_genomic.fna.gz" hostgen_taxid_sm: sars2_wiv04_2: 9606
If you are using slurm, create a sbatch execution file, example2.bash:
#SBATCH --job-name=sars2_wiv04_2 #SBATCH --account=my_project #SBATCH --time=06:00:00 #SBATCH --mem-per-cpu=4G #SBATCH --cpus-per-task=32 #SBATCH --partition=small conda activate lazypipe snakemake -p --cores $SLURM_CPUS_PER_TASK results/sars2_wiv04_2.tar.gz
Execute by calling sbatch:
sbatch example2.bash
The following table displays virus abundancies reported by Lazypipe for SARS positive libraries ( PRJNA605983).
SRA run | SRA experiment | Platform | Library | Virus | Taxid | readn | readn% | csumq | contign |
---|---|---|---|---|---|---|---|---|---|
SRR11092063 | SRX7730880 | RNA-Seq Illumina HiSeq 3000 | WIV02-2 | Severe acute respiratory syndrome-related coronavirus | 694009 | 559 | 0.3685% | 1 | 23 |
SRR11092057 | SRX7730886 | RNA-Seq Illumina MiSeq | WIV04 | Severe acute respiratory syndrome-related coronavirus | 694009 | 732 | 13.0878% | 1 | 15 |
SRR11092062 | SRX7730881 | RNA-Seq Illumina HiSeq 1000 | WIV04-2 | Severe acute respiratory syndrome-related coronavirus | 694009 | 5918 | 3.0027% | 1 | 1 |
SRR11092062 | SRX7730881 | RNA-Seq Illumina HiSeq 1000 | WIV04-2 | Influenza A virus | 11320 | 274 | 0.1390% | 1 | 2 |
SRR11092062 | SRX7730881 | RNA-Seq Illumina HiSeq 1000 | WIV04-2 | Autographa californica multiple nucleopolyhedrovirus | 307456 | 205 | 0.1040% | 1 | 2 |
SRR11092061 | SRX7730882 | RNA-Seq Illumina HiSeq 3000 | WIV05 | Severe acute respiratory syndrome-related coronavirus | 694009 | 234 | 0.0510% | 1 | 20 |
SRR11092061 | SRX7730882 | RNA-Seq Illumina HiSeq 3000 | WIV05 | Saccharomyces 20S RNA narnavirus | 186772 | 135 | 0.0294% | 2 | 1 |
SRR11092060 | SRX7730883 | RNA-Seq Illumina HiSeq 3000 | WIV06-2 | Severe acute respiratory syndrome-related coronavirus | 694009 | 525 | 0.1417% | 1 | 22 |
SRR11092060 | SRX7730883 | RNA-Seq Illumina HiSeq 3000 | WIV06-2 | Spodoptera frugiperda rhabdovirus | 1481139 | 165 | 0.0445% | 1 | 1 |
SRR11092060 | SRX7730883 | RNA-Seq Illumina HiSeq 3000 | WIV06-2 | Saccharomyces 20S RNA narnavirus | 186772 | 103 | 0.0278% | 2 | 3 |
SRR11092059 | SRX7730884 | RNA-Seq Illumina HiSeq 3000 | WIV07-2 | Influenza A virus | 11320 | 9063 | 0.0974% | 1 | 4 |
SRR11092059 | SRX7730884 | RNA-Seq Illumina HiSeq 3000 | WIV07-2 | Saccharomyces 20S RNA narnavirus | 186772 | 3386 | 0.0364% | 1 | 1 |
SRR11092059 | SRX7730884 | RNA-Seq Illumina HiSeq 3000 | WIV07-2 | Severe acute respiratory syndrome-related coronavirus | 694009 | 819 | 0.0088% | 2 | 16 |
SRR11092059 | SRX7730884 | RNA-Seq Illumina HiSeq 3000 | WIV07-2 | Bamboo mosaic virus | 35286 | 325 | 0.0035% | 2 | 1 |
SRR11092059 | SRX7730884 | RNA-Seq Illumina HiSeq 3000 | WIV07-2 | Spodoptera frugiperda rhabdovirus | 1481139 | 168 | 0.0018% | 2 | 1 |