Cost efficiently pair each line with lines of another file

Question

I have a very huge file (~10Gb data) which contains data in below format -

'1','1' '2','2' '3','3' '4','4' '5','5' '6','6' '7','7' '8','8' '9','9' '10','10'

and format of another file (which is 300Kb in size) is -

1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,10 2,1 2,3 2,4 2,5 2,6 2,7 2,8 2,9

desired output -

'1','1','1,2', '2','2','1,3', '3','3','1,4', '4','4','1,5', '5','5','1,6', '6','6','1,7', '7','7','1,8', '8','8','1,9', '9','9','1,10', '10','10','2,1',

since the input file contains more than 10 million records. so doing it via a loop will be a very costly operation.

paste will work if number of lines in both of the files are same. This is not the case here. The larger file has 1Billions rows while the smaller one has only 90 rows. Every lines in the larger file should have a value from smaller file. — anurag
– anurag, Commented Dec 28, 2015 at 7:05
What criteria select the matches? How many lines are in each of the files (a solution valid if both have a thousand lines isn't necessarily useful if they have millions of lines, and if one has millions and the other one a few dozen will suggest yet another approach)? Are the files sorted (or does the order matter), for input and output? What tools are available (i.e., can write a specific C program, has to be done in shell, a scripting language like Python or Perl is acceptable, ...)? — vonbrand
– vonbrand, Commented Dec 28, 2015 at 16:23
I don't understand how the lines are matched. You seem to match line N of file 1 with line N of file 2, but what happens when you go beyond the end of the shortest file? — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Dec 28, 2015 at 22:58

iruvar · Accepted Answer · 2015-12-28 16:11:08Z

Done this via below -

awk 'FNR==NR{a[i++]=$0; max=i; next} {if ((NR % max) == 0) {i=max-1} else {i=(NR%max) - 1}; printf "%s,%s\n",$0,a[i]}' smaller_file larger_file

But if someone knows the faster way than this, please suggest

iruvar · Accepted Answer · 2015-12-28 16:38:59Z

It would appear that you're looking to cycle through the contents of the smaller file

With awk

awk 'NR == FNR{a[++i]=$0; next}; {print $0, a[FNR % i? FNR % i: i]}' smaller_file larger_file

And python

from itertools import cycle, izip with open('larger_file') as f1, open('smaller_file') as f2: z = izip(f1, cycle(f2)) for l, m in z: print l.rstrip('\n'), m.rstrip('\n')

mikeserv · Accepted Answer · 2015-12-28 16:42:52Z

paste -d",''," ./file1 - ./file2 - - </dev/null >out

...given your example data that writes to output:

'1','1','1,2', '2','2','1,3', '3','3','1,4', '4','4','1,5', '5','5','1,6', '6','6','1,7', '7','7','1,8', '8','8','1,9', '9','9','1,10', '10','10','2,1', ,'2,3', ,'2,4', ,'2,5', ,'2,6', ,'2,7', ,'2,8', ,'2,9', ,'',

It's a little difficult for me to tell exactly what the criteria are for stopping the output, but to write output identical to your example output:

{ paste -d",''," ./file1 - ./file2 - - | sed -ne's/,/&/4p;t' -eq } </dev/null

'1','1','1,2', '2','2','1,3', '3','3','1,4', '4','4','1,5', '5','5','1,6', '6','6','1,7', '7','7','1,8', '8','8','1,9', '9','9','1,10', '10','10','2,1',

Gilles 'SO- stop being evil' · Accepted Answer · 2015-12-28 23:31:18Z

As many have already pointed out, paste is the right tool here.

paste -d ,\'\' file1 /dev/null file2 /dev/null

If file2 is shorter than file1, then paste will act as if it had as many empty lines at the end to match file2.

If you want to act as if file2 repeated over and over, make it repeat over and over until you reach the line count of file1.

while true; do cat file2; done | head -n "$(wc -l file1)" | paste -d ,\'\' file1 /dev/null - /dev/null

This requires going over file1 twice. Depending on the relative speed of your CPU and I/O, it may be faster to eschew paste and instead use a tool that can process multiple files in a more flexible way, such as awk. Here's an awk solution that doesn't require loading either file in memory entirely (if file2 is small, the disk cache will take care of this anyway).

awk -v file2=file2 ' !getline s <file2 {close(file2); getline s <file2} {print $0 ",\047" s "\047"}' file1

Explanation: getline s <file2 reads the next line from file2, opening it if necessary. If this fails (because the end of the file has been reached), close the file and start again.

Stack Exchange Network

Cost efficiently pair each line with lines of another file

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Cost efficiently pair each line with lines of another file

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions