0

I have a small snippet of a file I'm working with:

ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000032737 ENSDARP00000049290 ENSDARG00000061051 ENSDARP00000081062 ENSDARG00000061051 ENSDARG00000061051 ENSDARP00000129708 

I only want to print the first instance of each unique value in the first column and the corresponding value in the second column, so my desired output would be:

ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062 

Is there a simple way to accomplish this with awk or uniq or something similar?

Any help would be appreciated.

4
  • 3
    See Remove lines based on duplicates within one column without sort Commented Apr 3, 2020 at 21:56
  • 1
    Does this answer your question? Remove lines based on duplicates within one column without sort Commented Apr 3, 2020 at 22:19
  • @steeldriver Not a dupe of that particular question as there is no issue with using sort here. Commented Apr 3, 2020 at 22:21
  • Hi and welcome to SE ! To date you were given 3 good answers. It is customary for those who post questions to accept the answer they deem best. You can do so by selecting the green check mark to the left of the answer you want to reward with karma points. This draws the attention of other users to the fact that your query received at least one good answer. Cheers. Commented Apr 4, 2020 at 18:02

4 Answers 4

2

POSIX AWK:

m1[$1] == 0 { m1[$1] = 1 print } 

For each line:

  1. see if first column exists in the "database"
  2. if not, add to "database" and print entire line
2
$ sort -s -k1,1 -u file ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062 

This sorts the file based on the first column only. While doing so, it ignores lines whose first column has already been seen.

Most implementations of sort has a non-standard -s option (used in the command above) that guarantees that it will be using a "stable" sorting algorithm. A stable sorting algorithm does not change the ordering of entries that have identical keys (first column in your case).


Note however that the longer transcript (which both Ensembl and Havana agrees 100% on) for the ENSDARG00000032737 gene is ENSDART00000049291, which codes for ENSDARP00000049290, not ENSDARP00000120731. But that's not really my business.

1
  • 1
    Thanks! I appreciate the feedback on the bioinformatics too, but your solution should work just fine for my purposes. Commented Apr 3, 2020 at 22:35
1

This idiomatic solution will work robustly using any awk in any shell on every UNIX box:

$ awk '!seen[$1]++' file ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062 
1
  • 2
    +1 Very concise and "idiomatic", as well as strictly equivalent to StevenPenny's answer (not at all meant as a criticism), but nevertheless a little abstruse at first sight for those who do not speak "awk" fluently. ;-)) Commented Apr 4, 2020 at 18:08
0

Best solutions already provided just posting my try

for i in `awk '{if(!seen[$1]++)print $1}' filename`; do sed -n '/'$i'/{p;q}' filename; done 

output

ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.