replaced http://unix.stackexchange.com/ with https://unix.stackexchange.com/

edited Apr 13, 2017 at 12:36

1

My question is similar to this question this question but with a couple of different constraints:

I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

My question is similar to this question but with a couple of different constraints:

I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

My question is similar to this question but with a couple of different constraints:

I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

formatting

Source Link

edited Aug 29, 2011 at 23:44

jasonwryan

74.9k
35
204
230

How to remove duplicate lines in a large multigigmulti-GB textfile?

My question is similar to this question but with a couple of different constraints:

I have a large '\n'\n delimited wordlist -- one word per line. Size of files range from 2gb2GB to as large as 10gb10GB.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst

awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket->ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

How to remove duplicate lines in a large multigig textfile?

My question is similar to this question but with a couple of different constraints:

I have a large '\n' delimited wordlist -- one word per line. Size of files range from 2gb to as large as 10gb.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst

awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket->ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

How to remove duplicate lines in a large multi-GB textfile?

My question is similar to this question but with a couple of different constraints:

I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

Tweeted twitter.com/#!/StackUnix/status/108021258540683265

occurred Aug 29, 2011 at 3:40

Source Link

asked Aug 29, 2011 at 2:40

greatwolf

283
1
2
8

How to remove duplicate lines in a large multigig textfile?

My question is similar to this question but with a couple of different constraints:

I have a large '\n' delimited wordlist -- one word per line. Size of files range from 2gb to as large as 10gb.
I need to remove any duplicate lines.
The process may sort the list during the course of removing the duplicates but not required.
There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst

awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket->ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

Stack Exchange Network

Return to Question

How to remove duplicate lines in a large multigigmulti-GB textfile?

How to remove duplicate lines in a large multigig textfile?

How to remove duplicate lines in a large multi-GB textfile?

How to remove duplicate lines in a large multigig textfile?