Revisions to Fast elimination of duplicate lines across multiple files

deleted 4 characters in body

edited Aug 19, 2019 at 15:54

586.3k
96
1.1k
1.7k

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR'!x{a[$0]="";nexta[$0];next}; !($0 in a)' foo/file x=1 bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh93. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}; !($0 in a)' foo/file bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh93. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm not sure I understand your question, but your code can be optimised to:

awk '!x{a[$0];next}; !($0 in a)' foo/file x=1 bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh93. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

made POSIX

Source Link

edited Sep 11, 2012 at 12:47

Stéphane Chazelas

586.3k
96
1.1k
1.7k

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}; !($0 in a)' foo/file bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (kshksh93. zsh or bashbash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}!($0 in a)' foo/file bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}; !($0 in a)' foo/file bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh93. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

typo

Source Link

edited Sep 11, 2012 at 12:15

Stéphane Chazelas

586.3k
96
1.1k
1.7k

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}!($0 in a)' foo/file bar/file > tmp

(yours addhad issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}!($0 in a)' foo/file bar/file > tmp

(yours add issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm not sure I understand your question, but your code can be optimised to:

awk 'NR==FNR{a[$0]="";next}!($0 in a)' foo/file bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

Source Link

answered Sep 11, 2012 at 12:09

Stéphane Chazelas

586.3k
96
1.1k
1.7k

Loading

Stack Exchange Network

Return to Answer