deduplication of lines in a large file

Question

The size of the file is 962,120,335 bytes.

HP-UX ******B.11.31 U ia64 ****** unlimited-user license

hostname> what /usr/bin/awk /usr/bin/awk: main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132) run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132) $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052 hostname> what /usr/bin/sed /usr/bin/sed: sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263) $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263 hostname>perl -v This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi hostname:> $ file /usr/bin/perl /usr/bin/perl: ELF-32 executable object file - IA64 hostname:> $ file /usr/bin/awk /usr/bin/awk: ELF-32 executable object file - IA64 hostname:> $ file /usr/bin/sed /usr/bin/sed: ELF-32 executable object file - IA64

There are no GNU tools here.
What are my options?

How to remove duplicate lines in a large multi-GB textfile?

and

http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique

throws

Out of Memory!

The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB 22900038, 24313871, 25609082, 18059622, 23678631, 32136363, 49294631, 61348150, 85237944, 70492586, 79842339, 72655093, 73474145, 82539534, 65101428, 57240031, 79481673, 539293, 38175881

Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?

The normal pattern for dedup'ing is sort .... | uniq. If sort is failing due to lack of memory then you could try breaking apart the file into many pieces (for example using split), depup'ing each part indidivually, cating them back together (which hopefully results in a smaller file than the original), then dedup that. — Celada
– Celada, Commented Mar 19, 2015 at 7:53
Doesn't a simple sort -u work? I know that the old sort from System V R4 used temporary files in /var/tmp if sorting in memory wasn't possible, so large files shouldn't be a problem. — wurtel
– wurtel, Commented Mar 19, 2015 at 7:59
Do you need to preserve the order of the first occurrences of each line? If not sort -u is the clear solution. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Mar 19, 2015 at 22:57

Chris Davies · Accepted Answer · 2017-12-23 12:39:33Z

3

It seems to me that the process you're following at the moment is this, which fails with your out of memory error:

Create several data files
Concatenate them together
Sort the result, discarding duplicate records (rows)

I think you should be able to perform the following process instead

Create several data files
Sort each one independently, discarding its duplicates (sort -u)
Merge the resulting set of sorted data files, discarding duplicates (sort -m -u)

edited Dec 23, 2017 at 12:39

answered Mar 19, 2015 at 13:54

Chris Davies

128k16 gold badges179 silver badges324 bronze badges

How to merge? To be able to merge in reasonable time, we need some lookup logic, e.g. hash table, but then we again face the same problem -> not enough memory to store huge hash table.

Boy
– Boy

2017-12-21 19:15:59 +00:00
Commented Dec 21, 2017 at 19:15
@Borna why would you want a hash table when merging multiple pre-sorted files? These external merge-sort algorithms have been around since the days of magnetic tape - at least 50 years ago.

Chris Davies
– Chris Davies

2017-12-21 19:35:48 +00:00
Commented Dec 21, 2017 at 19:35
That is exactly what I was looking for, thank you sir! One question, I was wondering how efficient would it be to create n files in a single directory (under Linux), where each file name is a row from the 'non-unique-lines' file (lets say no illegal chars for the file name), and thus eliminating duplicate rows.

Boy
– Boy

2017-12-22 20:42:09 +00:00
Commented Dec 22, 2017 at 20:42
@Borna that sounds an interesting question in its own right. When you've asked it I'd appreciate a ping back here with the reference and I'll take a look

Chris Davies
– Chris Davies

2017-12-22 23:17:48 +00:00
Commented Dec 22, 2017 at 23:17

Add a comment |

MariusMatutiae · Accepted Answer · 2015-03-19 08:15:45Z

0

Of course there are no GNU/Linux tools: what is part of the Source Code Control System (SCCS), which I do not believe exists at all in Linux.

So, presumably you are on Unix. There the sort algorithm is capable of dealing with these problems: the Algorithmic details of UNIX Sort command states that an input of size M, with a memory of size N, is subdivided into M/N chunks that fit into memory, and which are worked upon serially.

It should fit the bill.

answered Mar 19, 2015 at 8:15

MariusMatutiae

4,3841 gold badge27 silver badges37 bronze badges

2

The question states that the op is on HP-UX. SCCS is proprietary, but you can use what on LInux if you install GNU CSSC

Anthon
– Anthon

2015-03-19 08:23:59 +00:00
Commented Mar 19, 2015 at 8:23
Sun open-sourced SCCS as part of the Heirloom project. The README states that it has been successfully built on Linux.

Warren Young
– Warren Young

2015-03-19 08:30:58 +00:00
Commented Mar 19, 2015 at 8:30
@Anthon and WarrenYoung Thanks, I did not know this.

MariusMatutiae
– MariusMatutiae

2015-03-19 08:41:55 +00:00
Commented Mar 19, 2015 at 8:41
SCCS is opensource on my request since December 2006. Be however careful with heirloom SCCS, since it got only attention for a few months and is dead since spring 2007. It still has bugs on Linux. The sccs project on sourceforge sccs.sf.net is actively maintained and without known bugs.

schily
– schily

2020-06-03 17:38:26 +00:00
Commented Jun 3, 2020 at 17:38

Add a comment |

athena · Accepted Answer · 2015-03-20 00:20:39Z

0

% perl -ne 'if ( $seen{$_}++ ) { $count++ ; if ($count > 1000000) { $seen = () ; $count = 0 ; } } else { print ; }' <eof a a a b c a a a b c eof a b c %

edited Mar 20, 2015 at 0:20

answered Mar 19, 2015 at 15:33

athena

1,0957 silver badges24 bronze badges

That only works if the file is already sorted in some order. And in that case it can be done faster with uniq.

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2015-03-19 22:56:51 +00:00
Commented Mar 19, 2015 at 22:56
Right! I fixed my slow uniq for a better approach based on the hypothesis the repetitions aren't too sparsed.

athena
– athena

2015-03-20 00:23:28 +00:00
Commented Mar 20, 2015 at 0:23

Add a comment |

Stack Exchange Network

deduplication of lines in a large file

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

deduplication of lines in a large file

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions