pl
Test ClusTRace by running all steps on sample data
perl clustrace.pl --fasta data/samples/delta-s1.fasta --res results/delta-s1 --pipe all -t 16 -v
Assign lineage to fasta sequences with Pangolin
perl clustrace.pl --fasta my/data/january.fasta --res results/January --pipe pangolin
Collect consensus sequences to multi-fasta files by assigned lineage (--res dir must contain lineage_report.csv generated in the previous step). Target analysis to Alfa and Beta variants of concern (VOC):
perl clustrace.pl --fasta my/data/january.fasta --res results/January --pipe collect --target B.1.1.7,B.1.351
Analyse collected multi-fasta: remove outliers + create MSAs + create trees
perl clustrace.pl --res results/January --pipe filter,align,tree -t 16 -v
Control outlier filtering: filter my seqlength 5% deviation from median + by >10% gaps
perl clustrace.pl --res results/January --minlen 95 --maxlen 105 --maxgap 10 --pipe f,a,t
Extract clusters for Alfa variant with TreeCluster. Clusters will be extracted with max-clade method at different max mutation rates. Pipeline will also create summary Excel table with cluster statistics and growth rates, and Nexus trees with clusters identified by node color and label.
perl clustrace.pl --res results/January --pipe cl --tperiod week --target B.1.1.7 -v
Create cluster MSA(s) , VCF files (Variant Call Format files) and VCF summaries. These will include both nucleotide and amino acid variants.
perl clustrace.pl --res results/January --pipe vcall --target B.1.1.7 --refvar data/lineage_variants.tab
Create lineage VCF files and summaries
perl clustrace.pl --res results/January --pipe vclineage --refvar data/lineage_variants.tab
Cleenup temporary files and pack results to a tarball
perl clustrace.pl --res results/January --pipe cleen,pack
Option | Value | [Default] | Function |
---|---|---|---|
--fasta | file | Input multifasta (*.fa or *.fasta) | |
--res | dir | results | Output directory |
--log | dir | log | Directory for logging |
--colpan | str | dark2 | Color scheme for coloring clusters: rgb|paired|dark2 (for preview see [https://colorbrewer2.org/]) |
--minseqn | int | 10 | Lineage filtering: exclude lineages with seqn < minseqn |
--minlen | int | 90 | Sequence outlier filtering: exclude sequences shorted than median_length*minlen% |
--maxlen | int | 110 | Sequence outlier filtering: exclude sequences longer than median_length*maxlen% |
--maxgap | int | 10 | Sequence outlier filtering: exclude sequences with gaps% > maxgap% |
--tree | str | iqtree | Run iqtree (IQ-Tree2 --mset GTR+F) or vftree (VeryFastTree --gtr -nt) |
--ufboot | false | Run iqtree with ultrafast bootstrap and create consensus tree (IQ-Tree2 -B 1000 -bnni) | |
--trimal_gt | num | 0.9 | trimal -gt threshold. Used to trim MSAs before tree construction |
--tperiod | str | month | Time period for cluster analysis. Accepted values: month|week |
--outgroup | file | data/NC_045512.fa |
Fasta with outgroup sequence |
--refgen | file | data/NC_045512.fa | Fasta with reference genome |
--refvar | file | File with reference lineage variants. Format: lineageid \t gene1: var1,var2,..[,varn]; gene2: var1,var2,..[,varn] \n GISAID characteristic mutations for some SARS-CoV-2 lineages are available in data/lineage_variants.tab |
|
--pipe | str | all | Comma-separated list of steps to perform, eg --pipe p,c,f,a |
p|pangolin | Assign lineages with Pangolin. Lineage report is printed to --res dir | ||
c|collect | Collect sequences for each lineage into multi-fasta | ||
f|filter | Filter lineage multi-fasta | ||
a|align | Create MSAs for each lineage multifasta in --res dir | ||
t|tree | Create ML-trees for each lineage MSA in --res dir. Use --ufboot option to create concensus trees. |
||
cl|clust | Extract clusters at various mutation rates | ||
vc|vcall | Create MSA and VCF files for all clusters. Add vcf variants to cluster summary excel. | ||
vclineage | Create VCF files for all lineage MSAs in --res dir. Create excel summary with VCF variants for each lineage. | ||
pack | Pack results into a tarball. Tarball will be created to the root directory of --res dir. | ||
cleen | Cleen up space by removing all intermediate and temporary files. | ||
all | Run all steps | ||
--target | str | false | Comma-separated list of target lineages to analyze (eg --target B.1.1.7). When omitted, will analyze all lineages |
--numth | int | 8 | Number of threads |
--short | true | Truncate sequence names to the first occurrence of "_" | |
-v | false | Run in verbal mode |
Results will be printed to --res dir.
File | --pipe step | Description |
---|---|---|
lineage.fa | collect | Multi-fasta with sequences for each lineage. After running --pipe filter outliers are excluded from these files. |
lineage.fa.flt | filter | Multi-fasta with filtered (i.e. excluded) sequences for each lineage |
lineage.fa.stats | filter | Statistics (length, gap content, ..) and applied filters for sequences in each lineage (tab-delimited format). |
lineage.msa | align | MSA for each analysed lineage/multifasta |
lineage.ml.tree | tree | Maximum likelihood tree for each analysed lineage/multifasta, newick format. |
lineage.con.tree | tree | Bootstrap consensus tree for each analysed lineage/multifasta, newick format. Required options: --tree iqtree --ufboot |
lineage.mr=x.nex | clust | Clusters for consensus (or ml) tree at mutation rate X highlighted in different colors, nexus tree file |
lineage.cl.xlsx | clust | Clusters for consensus (or ml) tree at different mutation rates, Excel table |
lineage.cluster_summary.xlsx | clust | Cluster summary for each lineage. Includes data sheets clustSeqN, clustSeqID, clustGR_MR=X, clustMutations_MR=X and clustMutationTable_MR=X. |
sheet: clustSeqN | clust | Reports the number of sequences in each cluster for each time period |
sheet: clustSeqID | clust | Reports sequence ids assigned to each cluster at each time period |
sheet: clustGR_MR=X | clust | Reports cluster growth rates and support values |
sheet: clustMutations_MR=X | vcall | Reports nt, aa, reference aa and non-reference aa mutations for each cluster. Reporting non-reference aa mutations requires option --refvar file. |
sheet: clustMutationTable_MR=X | vcall | Reports aa mutations for the 10 fastest growing clusters in a binary matrix. Top row lists aa mutations in genomic order with non-refenrece mutations highlighted in bold. |
LEGEND: "period", date period (data from the first date to this date), "mr", mutation rate, "cluster", cluster id, "seqn", number of sequences assigned to this cluster, "subclustern", number of subclusters for this cluster, "support", bootstrap support | ||
lineage.vcf | vclineage | Variant Call Format file (VCF) with nt and aa variants for each analysed lineage |
lineageSummary.xlsx | vclineage | Variant summary for analysed lineaged. Includes data sheet lineageMutations. |
sheet: lineageMutations | vclineage | Reports nt, aa, reference aa and non-reference aa mutations for each lineage. Reporting non-reference aa mutations requires option --refvar file. |
In-house data can be exported from an excel file and displayed in any nexus or newick tree using a simple procedure:
More information at https://bitbucket.org/plyusnin/clustrace/src/master/