2
$\begingroup$

I'm working with targeted Illumina sequencing data generated with DNA from diseased and healthy tissue (this is an age-related disease and is not cancer/neoplastic). My hypothesis is that the diseased tissue might contain specific somatic mutations in one particular gene (let's call it GENE), while the healthy tissue will not. The reasoning behind this: The same mutations cause autosomal-dominant versions of my disease of interest when present in the germline (my diseased tissue is from sporadic disease). My coverage varies, but is in the range of 1000x - 50'000x. My sequencing protocol employed UMIs.

I have run Mutect2 in default settings and with lower thresholds (--tumor-lod-to-emit) and haven't found any calls that correspond to these pathogenic mutations, so my results are negative. Nevertheless, I have reason to believe that my pathogenic mutations might be present at very low VAF (around 0.05-0.1 %), based on other experiments.

My question: How do I quantify the number of reads that correspond to a particular variant of interest? Let's say that base 2000 in GENE is usually an A, whereas the mutation A>T would be pathogenic. How do I count the number of Ts in position 2000?

As I'm somewhat new to bioinformatics, I'm also very open to any other ideas. Thanks!


Edit: @terdon's approach worked perfectly in cases where I used a VCF for --alleles that had been generated as an output of Mutect2, such as the following one:

VCF produced by Mutect:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr20 4699605 . A G . . AS_SB_TABLE=881,932|920,847;DP=3791;ECNT=1;MBQ=20,17;MFRL=248,243;MMQ=60,60;MPOS=21;POPAF=7.30;TLOD=2014.57 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:1813,1767:0.469:3580:559,218:384,191:1720,1679:881,932,920,847 

Curiously, if I used a manually produced file such as the following, no output was produced, unless the variant of interest was heterozygous:

Manually produced VCF:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr20 4699605 rs1799990 A G . . . 

Output using the manual VCF, heterozygous sample:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT heterogyzous_sample chr20 4699605 . A G . . AS_SB_TABLE=881,932|920,847;DP=3791;ECNT=1;MBQ=20,17;MFRL=248,243;MMQ=60,60;MPOS=21;POPAF=7.30;TLOD=2014.57 GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:1813,1767:0.469:3580:559,218:384,191:1720,1679:881,932,920,847 

Output using the manual VCF, non-heterozygous sample:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT novariant_sample 

Manual inspection of this non-heterozygous sample revealed that it contained 1 read with the ALT variant, so this doesn't seem to depend on the variant being absent.

$\endgroup$

1 Answer 1

4
$\begingroup$

I have a little function that can report this by using samtools to look at the read pileup. Um. It isn't the most elegant piece of code, but it does sorta do the job:

variantSupport () { genome=$1 bam=$2; range=$3; samtools mpileup -f "$genome" -r "$range" "$bam" | cut -f 5 | tr '[a-z]' '[A-Z]' | fold -w 1 | sort | uniq -c | awk 'BEGIN{tot=1}$2~/,/{ $2="." } { tot+=$1; c[$2]+=$1 } $2~/[a-zA-Z]/{ varReads+=$1; } END{ for(i in c){ print i,c[i] } printf "%s of %s reads (%.1f%%), support the variant.\n",varReads,tot,varReads*100/tot }' } 

Add the function to your ~/.bashrc or just paste it directly into a terminal and you can then do:

$ variantSupport /path/to/genome/hg19.fa foo.bam chr13:28607916-28607916 [mpileup] 1 samples in 1 input files A 915 * 1 . 93 915 of 1010 reads (90.6%), support the variant. 

This is telling me that at position 28607916 of chr13, I have 1 read supporting a deletion (*, see https://www.htslib.org/doc/samtools-mpileup.html for an explanation of the symbols), 93 reads supporting the reference residue (T, in this case) and 915 supporting an A, so a T=> A variant.

However, a better approach is to use a variant caller and force it to call the variant you want. You can then investigate the support for it. With mutect2, you can use the --alleles option to force the calling and then use the -L option to limit calling to only the specific region. If you give the same vcf file for both options, that will return results for the variants you want and only those. For example, given this file foo.vcf.gz (which needs to be compressed using bgzip and indexed using tabix):

$ zcat 1.vcf.gz ##fileformat=VCFv4.3 ##reference=/genomes/hg19/hg19.fa ##FORMAT=<ID=AB,Number=A,Type=Float,Description="Allelic Balance"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Net Genotype quality across all datasets, calculated from GQ scores of callsets supporting the consensus GT, using only one callset from each dataset"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##contig=<ID=chr13,assembly=hg19,length=115169878> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG003 chr13 28607917 . G C . . . GT:AB:DP:GQ 1/1:0.6:395:318 

You can then do:

gatk Mutect2 \ -R /genomes/hg19/hg19.fa \ --alleles 1.vcf.gz \ -L 1.vcf.gz \ -I foo.bam \ -O out.vcf 

This will produce an out.vcf like this (I am excluding most headers for clarity):

 ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 331970 chr13 28607917 . G C . . DP=1107;ECNT=1;MBQ=30,0;MFRL=353,0;MMQ=60,60;MPOS=50;POPAF=7.30;TLOD=-2.716e+00 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1043,0:9.616e-04:1043:518,0:496,0:989,54,0,0 

I can see this is not a real variant since the value for AD, the number of reads supporting the reference and the variant is 1043,0, meaning that 1043 reads support the reference and none support the variant.

This way, you can get results for your variants of interest and see if there is any evidence supporting them.

$\endgroup$
7
  • $\begingroup$ Thanks! Your answer does what I needed. However, I have noticed that the Allele Depth numbers in the VCF (out.vcf, in your code) are lower than expected. For instance, I got REF=105 and ALT=104, whereas the numbers are REF = 436 and ALT = 380 if I extract this information with bcftools and set -d to a high number. I need the total read numbers, so how can I edit your command to achieve this? $\endgroup$ Commented Mar 4, 2024 at 14:47
  • 1
    $\begingroup$ @Nereus I don't think you need the total, actually. The ones that are not counted by my function are the the ones samtools skips by default (see htslib.org/doc/samtools-mpileup.html; for example, it won't count bases with a fastq quality under 13), and the ones not included in the AD value are those the variant caller filtered out. You can't just use all reads, you must only take into account the ones that pass some basic QC filtering otherwise, you just get nonsense. If you see such huge differences, I would investigate the quality (both mapping and fastq) in those regions. $\endgroup$ Commented Mar 4, 2024 at 14:53
  • $\begingroup$ may I ask how you produced the 1.vcf.gz file in your example? $\endgroup$ Commented Mar 6, 2024 at 14:59
  • 1
    $\begingroup$ @Nereus from one I had. I just looked at a bam file I had lying around and made the vcf point to a variant in that file and ran the command I showed you. $\endgroup$ Commented Mar 6, 2024 at 15:02
  • $\begingroup$ I see. Would it also be possible to create a custom VCF if I haven't got one available, e.g. with data from Ensembl? Thank you for your time! $\endgroup$ Commented Mar 6, 2024 at 16:50

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.