Remove duplicate lines from file that are in a different order

Question

My file is like this:

alice, bob bob, cat cat, dennis cat, bob dennis, alice

I want to remove lines where same words have been repeated in reverse order. In this example, bob, cat and cat, bob are repeated, so cat bob should be removed and my output should be

alice, bob bob, cat cat, dennis dennis, alice

How can I do this?

Any restrictions regarding the other lines? I.e. can the fields be resorted and the lines be resorted, too? — FelixJN
– FelixJN, Commented Aug 4, 2019 at 16:18
no such restrictions. sorting can be done any number of times.. — user199046
– user199046, Commented Aug 4, 2019 at 16:28

steeldriver · Accepted Answer · 2019-08-04 15:43:32Z

You could use a hash that is keyed on the sorted elements:

$ perl -lne 'print unless $h{join ",", sort split /, /, $_}++' file alice, bob bob, cat cat, dennis dennis, alice

For exactly 2 fields, something like this might sufficce

$ awk -F', ' '!seen[$2 FS $1]; {seen[$0]++}' file alice, bob bob, cat cat, dennis dennis, alice

idk what the perl script does but that awk script will use a lot more memory than necessary, see unix.stackexchange.com/a/533876/133219 for the idiomatic awk approach. — Ed Morton
– Ed Morton, Commented Aug 4, 2019 at 22:45

Ed Morton · Accepted Answer · 2019-08-04 22:44:17Z

1

The idiomatic awk answer:

$ awk -F', ' '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file alice, bob bob, cat cat, dennis dennis, alice

The general approach for any number of fields is to sort them and use the sorted list as the index to seen[].

answered Aug 4, 2019 at 22:44

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

1

Can you please explain how logic?

Death Metal
– Death Metal

2019-08-08 20:28:34 +00:00
Commented Aug 8, 2019 at 20:28
1

@DeathMetal It creates a common index out of each pair of key fields by putting them in greatest-first order so A B and B A both become the index B A. Then it just tests to see if the given index has been seen before - first time either A B or B A is encountered in the input seen["B A"]++ is 0, 2nd time it's 1, and so on. The ! at the front ensures that the default action of printing the current input line only occurs when seen["B A"]++ is zero, i.e. the first time its seen in the input.

Ed Morton
– Ed Morton

2019-08-08 20:52:46 +00:00
Commented Aug 8, 2019 at 20:52

Add a comment |

FelixJN · Accepted Answer · 2019-08-04 21:11:00Z

-1

This sorts every line by its fields, then the file and pick unique lines only

while read line do echo $line | tr ' ,' '\n' | sort | tr '\n' ',' done < 1 | sed -e 's/^,//' -e 's/,$//' -e 's/,,/\n/g' | sort -u

answered Aug 4, 2019 at 21:11

FelixJN

14.1k2 gold badges36 silver badges55 bronze badges

It will also strip some white space, interpret escape sequences, do globbing, word splitting and file name generation, be extremely slow and rely on sed being able to operate on a non-POSIX-compliant input stream. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice and mywiki.wooledge.org/Quotes for a description of some of the issues.

Ed Morton
– Ed Morton

2019-08-05 00:29:06 +00:00
Commented Aug 5, 2019 at 0:29

Add a comment |

Stack Exchange Network

Remove duplicate lines from file that are in a different order

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Remove duplicate lines from file that are in a different order

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions