How to get all columns based on two columns using awk?

Question

I have a file like this:

sample chr start end ref alt gene effect AADA-01 chr1 12336579 12336579 C T VPS13D Silent AADA-02 chr1 20009838 20009838 - CCA TMCO4 Missense AADA-03 chr1 76397825 76397825 GTCA T ASB17 Missense AADA-03 chr1 94548954 94548954 C A ABCA4 Missense AADA-04 chr1 176762782 176762782 TCG C PAPPA2 Missense AADA-04 chr1 183942764 183942764 - T COLGAL Missense AADA-05 chr1 186076063 186076063 A TGC HMCN1 Silent AADA-05 chr1 186076063 186076063 A T HM1 Silent

I need all the lines where the 5th and 6th columns contain only one character.

And the result should look like:

sample chr start end ref alt gene effect AADA-01 chr1 12336579 12336579 C T VPS13D Silent AADA-03 chr1 94548954 94548954 C A ABCA4 Missense AADA-05 chr1 186076063 186076063 A T HM1 Silent

I tried using this.

awk -F'\t' '$5' filename | awk -F'\t' '$6' filename | wc -l

I know this is wrong but can anyone correct my mistake please.

You will have to explain a bit more about how the expected output is derived from the given input. — Kusalananda
– Kusalananda ♦, Commented Apr 24, 2017 at 8:05
I think you may want to take the values of 5th and 6th columns with this awk syntax: awk 'OFS="\t" {print $5,$6}' dataframe_file, but you haven't provided enough information about how you want to filter based on these fields. — Zumo de Vidrio
– Zumo de Vidrio, Commented Apr 24, 2017 at 8:09
We can't give you the right command if you don't tell us what you are trying to do. What is the relationship between your input and output? How do you get from the input to that output? You seem to just be printing 3 random lines. Please edit your question and explain your objective. — terdon
– terdon ♦, Commented Apr 24, 2017 at 8:11
It is clear that you want to pick out a unique line for each sample, but I can't see what criteria you use to pick the one you do. — Kusalananda
– Kusalananda ♦, Commented Apr 24, 2017 at 8:11
And do you also want single nucleotide deletions? What if the ref is T and the alt is -, for example? Both T and - are single characters, should that line be printed? — terdon
– terdon ♦, Commented Apr 24, 2017 at 8:38

Valentin B. · Accepted Answer · 2017-04-24 08:29:25Z

awk 'NR==1{print; next} $5 ~ /^[A-Z]$/ && $6 ~ /^[A-Z]$/' input.txt

Explanation

NR==1{print; next}

This prints the first line (header) unconditionally and goes to the next line.

$5 ~ /^[A-Z]$/ && $6 ~ /^[A-Z]$/

This is a conditionnal expression: if the 5th AND the 6th arguments both match a single upper case letter, then print the line (the print command is implicit in this case being the default instruction for any condition).

$5 and $6 stand for the 5th and 6th column for each line.

&& is the logical operator AND.

~ is the regexp matching operator. It returns true if the argument on the left-hand side matches the regexp on the right-hand side.

/^[A-Z]$/ is a regular expression (regexp). The character "/" is a delimiter for the regexp, "^" indicates the beginning of a line (or the string), "$" the end, and "[A-Z]" means all upper case letters from A to Z.

Kusalananda · Accepted Answer · 2017-04-24 08:35:18Z

awk '$5 ~ /^[ACGT]$/ && $6 ~ /^[ACGT]$/ || NR == 1' data.in

This will, for the given data, generate

sample chr start end ref alt gene effect AADA-01 chr1 12336579 12336579 C T VPS13D Silent AADA-03 chr1 94548954 94548954 C A ABCA4 Missense AADA-05 chr1 186076063 186076063 A T HM1 Silent

The awk script tests columns 5 and 6 to see whether they are any of the single character A, C, G or T, or if the current line is the first line of the file. If so, it will print that line.

The test $5 ~ /^[ACGT]$/ means "see if column five matches the regular expression ^[ACGT]$". The regular expression will match anything that contains a single character in the given set ([ACGT]).

^ and $ are "anchors", they will only match in the very start and very end (respectively) of the given data (column five and column six).

&& and || are the logical AND and OR operators.

NR is the ordinal number of the current input line. If NR == 1 then the current line is the header line in the file. Since the header line doesn't fulfill the criteria to be outputted (ref and alt are not single letters and so wouldn't match the regular expression), this separate test has to be made to be sure to get it in the output.

user218374 · Accepted Answer · 2017-04-24 08:48:04Z

1

perl -lane 'print if $. == 1 or 2 == grep /^[A-Z]$/, @F[4,5]' data.in

answered Apr 24, 2017 at 8:48

user218374

Add a comment |

Stack Exchange Network

How to get all columns based on two columns using awk?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

How to get all columns based on two columns using awk?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions