0

Suppose that you have a big file with lines of this format: A, B, C, D (each line has 4 parts comma separated). I need a list of all the lines which have common fourth part (D is the same across those lines), but the rest is different (A, B, C).

So for example, duplicate lines should not appear in the output because even though they have same D part, but the rest is also the same.

Is there a way to do this?

P.S The file has ~8M rows so not possible to do something visually in a text editor.

1 Answer 1

2
 awk -F, -vD='D' '$4==D && !seen[$0]++' data 
  • -F, separate field by ,
  • -vD='D' assign desired 4th column to variable D, change 'D' to your desired 4th column value.
  • $4==D && !seen[$0]++ print line if 4th column is the same as variable D and it's unseen before.

If there are spaces after ,, use this instead:

 awk -vFS=', *' -vD='D' '$4==D && !seen[$0]++' data 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.