Skip to main content
replaced http://unix.stackexchange.com/ with https://unix.stackexchange.com/
Source Link

My question is similar to this questionthis question but with a couple of different constraints:

  • I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 
awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

My question is similar to this question but with a couple of different constraints:

  • I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 
awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

My question is similar to this question but with a couple of different constraints:

  • I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 
awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

formatting
Source Link
jasonwryan
  • 74.9k
  • 35
  • 204
  • 230

How to remove duplicate lines in a large multigigmulti-GB textfile?

My question is similar to this question but with a couple of different constraints:

  • I have a large '\n'\n delimited wordlist -- one word per line. Size of files range from 2gb2GB to as large as 10gb10GB.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst

awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket->ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

How to remove duplicate lines in a large multigig textfile?

My question is similar to this question but with a couple of different constraints:

  • I have a large '\n' delimited wordlist -- one word per line. Size of files range from 2gb to as large as 10gb.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst

awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket->ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

How to remove duplicate lines in a large multi-GB textfile?

My question is similar to this question but with a couple of different constraints:

  • I have a large \n delimited wordlist -- one word per line. Size of files range from 2GB to as large as 10GB.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 
awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket-ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?

Tweeted twitter.com/#!/StackUnix/status/108021258540683265
Source Link
greatwolf
  • 283
  • 1
  • 2
  • 8

How to remove duplicate lines in a large multigig textfile?

My question is similar to this question but with a couple of different constraints:

  • I have a large '\n' delimited wordlist -- one word per line. Size of files range from 2gb to as large as 10gb.
  • I need to remove any duplicate lines.
  • The process may sort the list during the course of removing the duplicates but not required.
  • There is enough space on the partition to hold the new unique wordlist outputted.

I have tried both of these methods but they both fail with out of memory errors.

sort -u wordlist.lst > wordlist_unique.lst 

awk '!seen[$0]++' wordlist.lst > wordlist_unique.lst

awk: (FILENAME=wordlist.lst FNR=43601815) fatal: assoc_lookup: bucket->ahname_str: can't allocate 10 bytes of memory (Cannot allocate memory)

What other approaches can I try?