5

The size of the file is 962,120,335 bytes.

HP-UX ******B.11.31 U ia64 ****** unlimited-user license

hostname> what /usr/bin/awk /usr/bin/awk: main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132) run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132) $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052 hostname> what /usr/bin/sed /usr/bin/sed: sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263) $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263 hostname>perl -v This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi hostname:> $ file /usr/bin/perl /usr/bin/perl: ELF-32 executable object file - IA64 hostname:> $ file /usr/bin/awk /usr/bin/awk: ELF-32 executable object file - IA64 hostname:> $ file /usr/bin/sed /usr/bin/sed: ELF-32 executable object file - IA64 

There are no GNU tools here.
What are my options?

How to remove duplicate lines in a large multi-GB textfile?

and

http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique 

throws

Out of Memory! 

The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB 22900038, 24313871, 25609082, 18059622, 23678631, 32136363, 49294631, 61348150, 85237944, 70492586, 79842339, 72655093, 73474145, 82539534, 65101428, 57240031, 79481673, 539293, 38175881

Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?

3
  • The normal pattern for dedup'ing is sort .... | uniq. If sort is failing due to lack of memory then you could try breaking apart the file into many pieces (for example using split), depup'ing each part indidivually, cating them back together (which hopefully results in a smaller file than the original), then dedup that. Commented Mar 19, 2015 at 7:53
  • 2
    Doesn't a simple sort -u work? I know that the old sort from System V R4 used temporary files in /var/tmp if sorting in memory wasn't possible, so large files shouldn't be a problem. Commented Mar 19, 2015 at 7:59
  • Do you need to preserve the order of the first occurrences of each line? If not sort -u is the clear solution. Commented Mar 19, 2015 at 22:57

3 Answers 3

3

It seems to me that the process you're following at the moment is this, which fails with your out of memory error:

  1. Create several data files
  2. Concatenate them together
  3. Sort the result, discarding duplicate records (rows)

I think you should be able to perform the following process instead

  1. Create several data files
  2. Sort each one independently, discarding its duplicates (sort -u)
  3. Merge the resulting set of sorted data files, discarding duplicates (sort -m -u)
4
  • How to merge? To be able to merge in reasonable time, we need some lookup logic, e.g. hash table, but then we again face the same problem -> not enough memory to store huge hash table. Commented Dec 21, 2017 at 19:15
  • @Borna why would you want a hash table when merging multiple pre-sorted files? These external merge-sort algorithms have been around since the days of magnetic tape - at least 50 years ago. Commented Dec 21, 2017 at 19:35
  • That is exactly what I was looking for, thank you sir! One question, I was wondering how efficient would it be to create n files in a single directory (under Linux), where each file name is a row from the 'non-unique-lines' file (lets say no illegal chars for the file name), and thus eliminating duplicate rows. Commented Dec 22, 2017 at 20:42
  • @Borna that sounds an interesting question in its own right. When you've asked it I'd appreciate a ping back here with the reference and I'll take a look Commented Dec 22, 2017 at 23:17
0

Of course there are no GNU/Linux tools: what is part of the Source Code Control System (SCCS), which I do not believe exists at all in Linux.

So, presumably you are on Unix. There the sort algorithm is capable of dealing with these problems: the Algorithmic details of UNIX Sort command states that an input of size M, with a memory of size N, is subdivided into M/N chunks that fit into memory, and which are worked upon serially.

It should fit the bill.

4
  • 2
    The question states that the op is on HP-UX. SCCS is proprietary, but you can use what on LInux if you install GNU CSSC Commented Mar 19, 2015 at 8:23
  • Sun open-sourced SCCS as part of the Heirloom project. The README states that it has been successfully built on Linux. Commented Mar 19, 2015 at 8:30
  • @Anthon and WarrenYoung Thanks, I did not know this. Commented Mar 19, 2015 at 8:41
  • SCCS is opensource on my request since December 2006. Be however careful with heirloom SCCS, since it got only attention for a few months and is dead since spring 2007. It still has bugs on Linux. The sccs project on sourceforge sccs.sf.net is actively maintained and without known bugs. Commented Jun 3, 2020 at 17:38
0
% perl -ne 'if ( $seen{$_}++ ) { $count++ ; if ($count > 1000000) { $seen = () ; $count = 0 ; } } else { print ; }' <eof a a a b c a a a b c eof a b c % 
2
  • That only works if the file is already sorted in some order. And in that case it can be done faster with uniq. Commented Mar 19, 2015 at 22:56
  • Right! I fixed my slow uniq for a better approach based on the hypothesis the repetitions aren't too sparsed. Commented Mar 20, 2015 at 0:23

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.