How do I print lines for the first appearance of a unique value in a 2-column file?

Question

I have a small snippet of a file I'm working with:

ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000032737 ENSDARP00000049290 ENSDARG00000061051 ENSDARP00000081062 ENSDARG00000061051 ENSDARG00000061051 ENSDARP00000129708

I only want to print the first instance of each unique value in the first column and the corresponding value in the second column, so my desired output would be:

ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062

Is there a simple way to accomplish this with awk or uniq or something similar?

Any help would be appreciated.

See Remove lines based on duplicates within one column without sort — steeldriver
– steeldriver, Commented Apr 3, 2020 at 21:56
Does this answer your question? Remove lines based on duplicates within one column without sort — Chris Davies
– Chris Davies, Commented Apr 3, 2020 at 22:19
@steeldriver Not a dupe of that particular question as there is no issue with using sort here. — Kusalananda
– Kusalananda ♦, Commented Apr 3, 2020 at 22:21
Hi and welcome to SE ! To date you were given 3 good answers. It is customary for those who post questions to accept the answer they deem best. You can do so by selecting the green check mark to the left of the answer you want to reward with karma points. This draws the attention of other users to the fact that your query received at least one good answer. Cheers. — Cbhihe
– Cbhihe, Commented Apr 4, 2020 at 18:02

Zombo · Accepted Answer · 2020-04-03 21:59:04Z

POSIX AWK:

m1[$1] == 0 { m1[$1] = 1 print }

For each line:

see if first column exists in the "database"
if not, add to "database" and print entire line

Kusalananda · Accepted Answer · 2020-04-04 13:27:27Z

$ sort -s -k1,1 -u file ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062

This sorts the file based on the first column only. While doing so, it ignores lines whose first column has already been seen.

Most implementations of sort has a non-standard -s option (used in the command above) that guarantees that it will be using a "stable" sorting algorithm. A stable sorting algorithm does not change the ordering of entries that have identical keys (first column in your case).

Note however that the longer transcript (which both Ensembl and Havana agrees 100% on) for the ENSDARG00000032737 gene is ENSDART00000049291, which codes for ENSDARP00000049290, not ENSDARP00000120731. But that's not really my business.

Thanks! I appreciate the feedback on the bioinformatics too, but your solution should work just fine for my purposes. — gpreising
– gpreising, Commented Apr 3, 2020 at 22:35

Ed Morton · Accepted Answer · 2020-04-04 12:48:00Z

1

This idiomatic solution will work robustly using any awk in any shell on every UNIX box:

$ awk '!seen[$1]++' file ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062

answered Apr 4, 2020 at 12:48

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

2

+1 Very concise and "idiomatic", as well as strictly equivalent to StevenPenny's answer (not at all meant as a criticism), but nevertheless a little abstruse at first sight for those who do not speak "awk" fluently. ;-))

Cbhihe
– Cbhihe

2020-04-04 18:08:01 +00:00
Commented Apr 4, 2020 at 18:08

Add a comment |

Praveen Kumar BS · Accepted Answer · 2020-04-05 18:06:38Z

Best solutions already provided just posting my try

for i in `awk '{if(!seen[$1]++)print $1}' filename`; do sed -n '/'$i'/{p;q}' filename; done

output

ENSDARG00000032737 ENSDARP00000120731 ENSDARG00000061051 ENSDARP00000081062

Stack Exchange Network

How do I print lines for the first appearance of a unique value in a 2-column file?

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

How do I print lines for the first appearance of a unique value in a 2-column file?

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions