The size of the file is 962,120,335 bytes.
HP-UX ******B.11.31 U ia64 ****** unlimited-user license
hostname> what /usr/bin/awk /usr/bin/awk: main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132) run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132) $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052 hostname> what /usr/bin/sed /usr/bin/sed: sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263) $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263 hostname>perl -v This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi hostname:> $ file /usr/bin/perl /usr/bin/perl: ELF-32 executable object file - IA64 hostname:> $ file /usr/bin/awk /usr/bin/awk: ELF-32 executable object file - IA64 hostname:> $ file /usr/bin/sed /usr/bin/sed: ELF-32 executable object file - IA64 There are no GNU tools here.
What are my options?
How to remove duplicate lines in a large multi-GB textfile?
and
http://en.wikipedia.org/wiki/External_sorting#External_merge_sort
perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique throws
Out of Memory! The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB 22900038, 24313871, 25609082, 18059622, 23678631, 32136363, 49294631, 61348150, 85237944, 70492586, 79842339, 72655093, 73474145, 82539534, 65101428, 57240031, 79481673, 539293, 38175881
Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?
sort .... | uniq. Ifsortis failing due to lack of memory then you could try breaking apart the file into many pieces (for example usingsplit), depup'ing each part indidivually,cating them back together (which hopefully results in a smaller file than the original), then dedup that.sort -uwork? I know that the oldsortfrom System V R4 used temporary files in /var/tmp if sorting in memory wasn't possible, so large files shouldn't be a problem.sort -uis the clear solution.