1

I have a very huge file (~10Gb data) which contains data in below format -

'1','1' '2','2' '3','3' '4','4' '5','5' '6','6' '7','7' '8','8' '9','9' '10','10' 

and format of another file (which is 300Kb in size) is -

1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,10 2,1 2,3 2,4 2,5 2,6 2,7 2,8 2,9 

desired output -

'1','1','1,2', '2','2','1,3', '3','3','1,4', '4','4','1,5', '5','5','1,6', '6','6','1,7', '7','7','1,8', '8','8','1,9', '9','9','1,10', '10','10','2,1', 

since the input file contains more than 10 million records. so doing it via a loop will be a very costly operation.

4
  • paste -d",''" 1 - 2 - - </dev/null Commented Dec 28, 2015 at 7:00
  • paste will work if number of lines in both of the files are same. This is not the case here. The larger file has 1Billions rows while the smaller one has only 90 rows. Every lines in the larger file should have a value from smaller file. Commented Dec 28, 2015 at 7:05
  • What criteria select the matches? How many lines are in each of the files (a solution valid if both have a thousand lines isn't necessarily useful if they have millions of lines, and if one has millions and the other one a few dozen will suggest yet another approach)? Are the files sorted (or does the order matter), for input and output? What tools are available (i.e., can write a specific C program, has to be done in shell, a scripting language like Python or Perl is acceptable, ...)? Commented Dec 28, 2015 at 16:23
  • I don't understand how the lines are matched. You seem to match line N of file 1 with line N of file 2, but what happens when you go beyond the end of the shortest file? Commented Dec 28, 2015 at 22:58

4 Answers 4

0

Done this via below -

awk 'FNR==NR{a[i++]=$0; max=i; next} {if ((NR % max) == 0) {i=max-1} else {i=(NR%max) - 1}; printf "%s,%s\n",$0,a[i]}' smaller_file larger_file 

But if someone knows the faster way than this, please suggest

0

It would appear that you're looking to cycle through the contents of the smaller file

With awk

awk 'NR == FNR{a[++i]=$0; next}; {print $0, a[FNR % i? FNR % i: i]}' smaller_file larger_file 

And python

from itertools import cycle, izip with open('larger_file') as f1, open('smaller_file') as f2: z = izip(f1, cycle(f2)) for l, m in z: print l.rstrip('\n'), m.rstrip('\n') 
0
paste -d",''," ./file1 - ./file2 - - </dev/null >out 

...given your example data that writes to output:

'1','1','1,2', '2','2','1,3', '3','3','1,4', '4','4','1,5', '5','5','1,6', '6','6','1,7', '7','7','1,8', '8','8','1,9', '9','9','1,10', '10','10','2,1', ,'2,3', ,'2,4', ,'2,5', ,'2,6', ,'2,7', ,'2,8', ,'2,9', ,'', 

It's a little difficult for me to tell exactly what the criteria are for stopping the output, but to write output identical to your example output:

{ paste -d",''," ./file1 - ./file2 - - | sed -ne's/,/&/4p;t' -eq } </dev/null 

'1','1','1,2', '2','2','1,3', '3','3','1,4', '4','4','1,5', '5','5','1,6', '6','6','1,7', '7','7','1,8', '8','8','1,9', '9','9','1,10', '10','10','2,1', 
0

As many have already pointed out, paste is the right tool here.

paste -d ,\'\' file1 /dev/null file2 /dev/null 

If file2 is shorter than file1, then paste will act as if it had as many empty lines at the end to match file2.

If you want to act as if file2 repeated over and over, make it repeat over and over until you reach the line count of file1.

while true; do cat file2; done | head -n "$(wc -l file1)" | paste -d ,\'\' file1 /dev/null - /dev/null 

This requires going over file1 twice. Depending on the relative speed of your CPU and I/O, it may be faster to eschew paste and instead use a tool that can process multiple files in a more flexible way, such as awk. Here's an awk solution that doesn't require loading either file in memory entirely (if file2 is small, the disk cache will take care of this anyway).

awk -v file2=file2 ' !getline s <file2 {close(file2); getline s <file2} {print $0 ",\047" s "\047"}' file1 

Explanation: getline s <file2 reads the next line from file2, opening it if necessary. If this fails (because the end of the file has been reached), close the file and start again.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.