SARS-CoV-2 variant calling

SARS-CoV-2 variant calling tutorial

Tutorial URL: tobiasrausch.com/courses/sars/

Code repository: github.com/tobiasrausch/covid19

Step 1: Alignment

In this tutorial we will analyze a positive SARS-CoV-2 sample that has been sequenced using an amplicon-based sequencing protocol. We will align the FASTQ data to the SARS-CoV-2 reference genome, call variants and determine the viral lineage (i.e. Alpha, Beta, Gamma, Delta or Omicron). All the required tools and data sets have been pre-installed in a conda environment. Please open a terminal and load the conda environment.


	export PATH=~/Desktop/covid19/conda/bin:${PATH}
	source activate covid19

Next let's create a working directory for all the output files.


	mkdir ~/Desktop/covid19/tmp
	cd ~/Desktop/covid19/tmp

The FASTQ files are in the data subdirectory and the SARS-CoV-2 reference is in the ref subdirectory.


	ls ../data/
	ls ../ref/NC_045512.2.fa
	head ../ref/NC_045512.2.fa

We can use bwa to align the sequencing data to the SARS-CoV-2 reference genome.


	bwa mem ../ref/NC_045512.2.fa ../data/Plate42B2/Plate42B2.R1.fastq.gz ../data/Plate42B2/Plate42B2.R2.fastq.gz > align.sam
	head align.sam

The alignment is encoded in SAM format. For mammalian-sized genomes, alignments are usually stored in a compressed binary format, called BAM, that can be indexed for random access so that, for instance, alignment viewers such as IGV can quickly visualize selected alignment regions. SAMtools is a suite of tools that can be used to work with alignment files.


	samtools sort -o align.bam align.sam
	samtools index align.bam

Once you have an alignment file it is always a good idea to look at the raw data in an alignment viewer such as IGV.


	../IGV_Linux_2.12.3/igv.sh -g ../ref/NC_045512.2.fa

Step 2: Variant Calling

Next, we want to transform the alignment into a list of variants. Hence, we are trying to identify positions, where the sequenced sample deviates from the SARS-CoV-2 reference genome. Variants are stored in VCF format.


	bcftools mpileup -f ../ref/NC_045512.2.fa align.bam | bcftools call -mv -Ob -o calls.bcf
	bcftools view calls.bcf
	bcftools view calls.bcf | grep -A 2 -m 1 "^#CHROM"

Step 3: Variant Annotation

Tools such as the variant effect predictor (VEP) can be used to determine the consequences of genomic variants. Please upload the below list of variants that you called from the alignment to the VEP online web application.


	bcftools view calls.bcf | grep -v "^#" | cut -f 1-5

Based on the VEP output, we will now try to manually identify the viral lineage (Alpha, Beta, Gamma, Delta, Omicron, etc.) using the predicted amino acid changes in the spike protein (S gene). A nice overview table of the different amino acid changes occurring in the different lineages is available on the outbreak.info website.

Step 4: Lineage classification

After years of the SARS-CoV-2 pandemic, we have of course tools to automatically determine the viral lineage and these usually require a so-called consensus sequence of the alignment. Tools such as iVar can generate such a viral consensus genome.


	samtools mpileup -aa -A -d 0 -B -Q 0 align.bam | ivar consensus -t 0.9 -m 20 -n N -p cons

This consensus sequence, we can use with pangolin or Nextclade to determine the lineage.


	source activate pangolin
	pangolin --outfile lineage.csv cons.fa

Nextclade has an online application that we can also use to upload the cons.fa file. Nextclade also integrates our sequence into a phylogenetic tree of publicly available SARS-CoV-2 sequences.