I have some .vcf files and I want to filter some variants out. This is just small part of my file: there are some header lines at the beginning of the file (starting with ##) and then variants (one row per variant).
##fileformat=VCFv4.2 ##source=combiSV-v2.2 ##fileDate=Mon May 8 11:32:53 2023 ##contig=<ID=chrM,length=16571> ##contig=<ID=chr1,length=249250621> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=SVCALLERS,Number=.,Type=String,Description="SV callers that support this SV"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# High-quality reference reads"> ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# High-quality variant reads"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 10862 id.1 N <INS> . PASS SVTYPE=INS;SVLEN=101;END=10862;SVCALLERS=cutesv,SVIM GT:DR:DV 1/1:0:26 1 90258 id.2 N <INS> . PASS SVTYPE=INS;SVLEN=118;END=90258;SVCALLERS=SVIM,NanoSV GT:DR:DV 1/1:0:9 1 90259 id.3 N <INS> . PASS SVTYPE=INS;SVLEN=36;END=90259;SVCALLERS=Sniffles GT:DR:DV 0/1:44:7 1 185824 id.4 N <DEL> . PASS SVTYPE=DEL;SVLEN=80;END=186660;SVCALLERS=Sniffles,cutesv GT:DR:DV 1/1:0:15 1 186241 id.5 N <DEL> . PASS SVTYPE=DEL;SVLEN=418;END=186662;SVCALLERS=SVIM,NanoSV GT:DR:DV 1/1:2:12 1 526111 id.6 N <DEL> . PASS SVTYPE=DEL;SVLEN=624;END=526735;SVCALLERS=Sniffles,cutesv GT:DR:DV 0/1:8 2 91926078 id.3958 N <BND> . PASS SVTYPE=BND;SVLEN=.;END=;SVCALLERS=Sniffles,NanoSV GT:DR:DV 0/1:60:15 While keeping the header lines, I want to remove rows with SVLEN < 100 and those with only one SVCALLERS included. These are two criteria that both must be met, in other words I want to keep only rows with SVLEN > 100 and at least two SVCALLERS).
In addition there are some rows where ALT is BND and the file does not provide any SVLEN for this type of variant, so if the row contains BND, I just want to keep it if it is supported by two callers.
Examples: I want to drop this variant because SVLEN is less than 100 and only one SVCALLERS detected it:
SVTYPE=INS;SVLEN=36;END=90259;SVCALLERS=Sniffles GT:DR:DV 0/1:44:7 1 185824 id.4 N <DEL> . PASS Or this row as well, although there are two callers but SVLEN is less than 100:
SVTYPE=DEL;SVLEN=80;END=186660;SVCALLERS=Sniffles,cutesv GT:DR:DV 1/1:0:15 1 186241 id.5 N <DEL> . PASS Is there an easy way to do it? Thanks
My final file should look like this:
##fileformat=VCFv4.2 ##source=combiSV-v2.2 ##fileDate=Mon May 8 11:32:53 2023 ##contig=<ID=chrM,length=16571> ##contig=<ID=chr1,length=249250621> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> ##INFO=<ID=SVCALLERS,Number=.,Type=String,Description="SV callers that support this SV"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DR,Number=1,Type=Integer,Description="# High-quality reference reads"> ##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# High-quality variant reads"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample 1 10862 id.1 N <INS> . PASS SVTYPE=INS;SVLEN=101;END=10862;SVCALLERS=cutesv,SVIM GT:DR:DV 1/1:0:26 1 90258 id.2 N <INS> . PASS SVTYPE=INS;SVLEN=118;END=90258;SVCALLERS=SVIM,NanoSV GT:DR:DV 1/1:0:9 1 186241 id.5 N <DEL> . PASS SVTYPE=DEL;SVLEN=418;END=186662;SVCALLERS=SVIM,NanoSV GT:DR:DV 1/1:2:12 1 526111 id.6 N <DEL> . PASS SVTYPE=DEL;SVLEN=624;END=526735;SVCALLERS=Sniffles,cutesv GT:DR:DV 0/1:8 2 91926078 id.3958 N <BND> . PASS SVTYPE=BND;SVLEN=.;END=;SVCALLERS=Sniffles,NanoSV GT:DR:DV 0/1:60:15
awk: Do you have an existing approach that you've already worked on and are stuck at?