Revisions to Remove duplicates csv based on first value keeping the longest line between duplicates

added 6 characters in body

edited Apr 4, 2020 at 21:30

user313992

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove extra lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove extra lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

deleted 25 characters in body

Source Link

edited Apr 4, 2020 at 21:21

user313992

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. it assumes that, between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do have it.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. it assumes that, between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do have it.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

clarify

Source Link

edited Apr 4, 2020 at 19:08

user313992

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. it assumes that, between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do have it.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

Try with sort(1):

sort -rt';' filename | sort -t';' -usk1,1 Aerial Assault (USA);Aerial Assault (USA);Sega Master System;;1990;Sega;Shooter;;;;;0;;;;; After Burner (World);After Burner (World);Sega Master System;;1988;Sega;Flying;;;;;0;;;;; Air Rescue (Europe);Air Rescue (Europe);Sega Master System;;1992;Sega;Shooter;;;;;0;;;;; Aladdin (Europe);Aladdin (Europe);Sega Master System;;1994;Sega;Platform;;;;;0;;;;;

Both sorts will use the ; as the field delimiter (-t';'). The first will reverse sort (-r), so that the empty fields come after the non-empty fields, and the second sort will sort by the first field (-k1,1), and remove lines with the same first field (-u = uniq), but will otherwise keep to order set by the first sort (-s = stable).

This assumes that instead of the "longest" line as the title says, you actually want the "most complete", ie. it assumes that, between two lines with the same first field, the shorter one has always a subset of the fields of the longer one (which is the only case where discarding the shorter lines can make any sense IMHO). It also assumes that your sort implementation has a -s (stable) option: both the GNU (Linux) and BSD sort do have it.

If you want to do it on a batch of files, you should use find:

find dir -type f -name '*.txt' \ -exec sh -c 'for f; do sort -rt";" "$f" | sort -t";" -usk1,1 > "$f.new" && echo mv "$f.new" "$f"; done' sh {} +

Adjust the find's predicates (-name, etc) and only remove the echo from before mv if you're ready to clobber your existing files.

deleted 18 characters in body

Source Link

edited Apr 2, 2020 at 14:02

user313992

Loading

Post Undeleted by CommunityBot

occurred Apr 2, 2020 at 13:56

added 7 characters in body

Source Link

edited Apr 2, 2020 at 13:54

user313992

Loading

added 7 characters in body

Source Link

edited Apr 2, 2020 at 13:48

user313992

Loading

Post Deleted by CommunityBot

occurred Apr 2, 2020 at 13:43

Post Undeleted by CommunityBot

occurred Apr 2, 2020 at 13:31

added 382 characters in body

Source Link

edited Apr 2, 2020 at 13:31

user313992

Loading

Post Deleted by CommunityBot

occurred Apr 2, 2020 at 13:11

Source Link

answered Apr 2, 2020 at 13:09

user313992

Loading

Stack Exchange Network

Return to Answer