In this tutorial we will analyze a positive SARS-CoV-2 sample that has been sequenced using an amplicon-based sequencing protocol. We will align the FASTQ data to the SARS-CoV-2 reference genome, call variants and determine the viral lineage (i.e. Alpha, Beta, Gamma, Delta or Omicron). All the required tools and data sets have been pre-installed in a conda environment. Please open a terminal and load the conda environment.
export PATH=~/Desktop/covid19/conda/bin:${PATH}
source activate covid19
Next let's create a working directory for all the output files.
mkdir ~/Desktop/covid19/tmp
cd ~/Desktop/covid19/tmp
The FASTQ files are in the data subdirectory and the SARS-CoV-2 reference is in the ref subdirectory.
ls ../data/
ls ../ref/NC_045512.2.fa
head ../ref/NC_045512.2.fa
We can use bwa to align the sequencing data to the SARS-CoV-2 reference genome.
bwa mem ../ref/NC_045512.2.fa ../data/Plate42B2/Plate42B2.R1.fastq.gz ../data/Plate42B2/Plate42B2.R2.fastq.gz > align.sam
head align.sam
The alignment is encoded in SAM format. For mammalian-sized genomes, alignments are usually stored in a compressed binary format, called BAM, that can be indexed for random access so that, for instance, alignment viewers such as IGV can quickly visualize selected alignment regions. SAMtools is a suite of tools that can be used to work with alignment files.
samtools sort -o align.bam align.sam
samtools index align.bam
Once you have an alignment file it is always a good idea to look at the raw data in an alignment viewer such as IGV.
../IGV_Linux_2.12.3/igv.sh -g ../ref/NC_045512.2.fa
bcftools mpileup -f ../ref/NC_045512.2.fa align.bam | bcftools call -mv -Ob -o calls.bcf
bcftools view calls.bcf
bcftools view calls.bcf | grep -A 2 -m 1 "^#CHROM"
bcftools view calls.bcf | grep -v "^#" | cut -f 1-5
Based on the VEP output, we will now try to manually identify the viral lineage (Alpha, Beta, Gamma, Delta, Omicron, etc.) using the predicted amino acid changes in the spike protein (S gene). A nice overview table of the different amino acid changes occurring in the different lineages is available on the outbreak.info website.
samtools mpileup -aa -A -d 0 -B -Q 0 align.bam | ivar consensus -t 0.9 -m 20 -n N -p cons
This consensus sequence, we can use with pangolin or Nextclade to determine the lineage.
source activate pangolin
pangolin --outfile lineage.csv cons.fa
Nextclade has an online application that we can also use to upload the cons.fa file. Nextclade also integrates our sequence into a phylogenetic tree of publicly available SARS-CoV-2 sequences.