3

I am trying this command:

join -a1 -11 file1 file2 > file3 

file1 looks like:

1 2 3 4 5 6 7 8 9 10 11 

file2:

1 lkj klj lkj 2 lkj lkj lkj 3 7 lkj lkj lkj 8 9 11 lkk kll lkk 

The output skips the row numbered 11.

While Googling I saw that join only understands alphabetical sorting but there must be a way to do this. My aim is to join five 60,000,000 line files for a genetic project.

How can I do this? Are there other tools or options to join to make it work?

2 Answers 2

3

I assume your large files are already sorted. The following method requires no further sorting.

You can simply add leading zeros to the keys, using sed ... Because the process is pipelined, there are no temporary files to deal with. The sed overhead it trivial.


# make key 9 digits # Add 9 leading 0's # Remove excess 0's join -a1 -11 <(sed -r 's/^([0-9]+)/000000000\1/; s/^0+([0-9]{9})/\1/' file1) \ <(sed -r 's/^([0-9]+)/000000000\1/; s/^0+([0-9]{9})/\1/' file2) 

Output is:

000000001 lkj klj lkj 000000002 lkj lkj lkj 000000003 000000004 000000005 000000006 000000007 lkj lkj lkj 000000008 000000009 000000010 000000011 lkk kll lkk 

If you don't want the leading zeros in the output, use this command instead.
The extra sed -r 's/^0+//' removes leading zeros.

join -a1 -11 <(sed -r 's/^([0-9]+)/000000000\1/;s/^0+([0-9]{9})/\1/' file1) \ <(sed -r 's/^([0-9]+)/000000000\1/;s/^0+([0-9]{9})/\1/' file2) | sed -r 's/^0+//' 

Output

1 lkj klj lkj 2 lkj lkj lkj 3 4 5 6 7 lkj lkj lkj 8 9 10 11 lkk kll lkk 
0
1

You can sort the input files, the sort numerically the output:

join -a1 -11 <(sort -k1,1 file1) <(sort -k1,1 file2) | sort -k1,1n 
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.