List lines that are the same in the second part but differ on the first part

Question

Suppose that you have a big file with lines of this format: A, B, C, D (each line has 4 parts comma separated). I need a list of all the lines which have common fourth part (D is the same across those lines), but the rest is different (A, B, C).

So for example, duplicate lines should not appear in the output because even though they have same D part, but the rest is also the same.

Is there a way to do this?

P.S The file has ~8M rows so not possible to do something visually in a text editor.

dedowsdi · Accepted Answer · 2019-05-18 01:28:01Z

 awk -F, -vD='D' '$4==D && !seen[$0]++' data

-F, separate field by ,
-vD='D' assign desired 4th column to variable D, change 'D' to your desired 4th column value.
$4==D && !seen[$0]++ print line if 4th column is the same as variable D and it's unseen before.

If there are spaces after ,, use this instead:

 awk -vFS=', *' -vD='D' '$4==D && !seen[$0]++' data

Stack Exchange Network

List lines that are the same in the second part but differ on the first part

1 Answer 1

You must log in to answer this question.

Hot Network Questions

List lines that are the same in the second part but differ on the first part

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions