join files with numbered index

Question

I am trying this command:

join -a1 -11 file1 file2 > file3

file1 looks like:

1 2 3 4 5 6 7 8 9 10 11

file2:

1 lkj klj lkj 2 lkj lkj lkj 3 7 lkj lkj lkj 8 9 11 lkk kll lkk

The output skips the row numbered 11.

While Googling I saw that join only understands alphabetical sorting but there must be a way to do this. My aim is to join five 60,000,000 line files for a genetic project.

How can I do this? Are there other tools or options to join to make it work?

Peter.O · Accepted Answer · 2011-11-15 06:00:52Z

I assume your large files are already sorted. The following method requires no further sorting.

You can simply add leading zeros to the keys, using sed ... Because the process is pipelined, there are no temporary files to deal with. The sed overhead it trivial.

# make key 9 digits # Add 9 leading 0's # Remove excess 0's join -a1 -11 <(sed -r 's/^([0-9]+)/000000000\1/; s/^0+([0-9]{9})/\1/' file1) \ <(sed -r 's/^([0-9]+)/000000000\1/; s/^0+([0-9]{9})/\1/' file2)

Output is:

000000001 lkj klj lkj 000000002 lkj lkj lkj 000000003 000000004 000000005 000000006 000000007 lkj lkj lkj 000000008 000000009 000000010 000000011 lkk kll lkk

If you don't want the leading zeros in the output, use this command instead.
The extra sed -r 's/^0+//' removes leading zeros.

join -a1 -11 <(sed -r 's/^([0-9]+)/000000000\1/;s/^0+([0-9]{9})/\1/' file1) \ <(sed -r 's/^([0-9]+)/000000000\1/;s/^0+([0-9]{9})/\1/' file2) | sed -r 's/^0+//'

Output

1 lkj klj lkj 2 lkj lkj lkj 3 4 5 6 7 lkj lkj lkj 8 9 10 11 lkk kll lkk

enzotib · Accepted Answer · 2011-11-14 12:58:46Z

You can sort the input files, the sort numerically the output:

join -a1 -11 <(sort -k1,1 file1) <(sort -k1,1 file2) | sort -k1,1n

Stack Exchange Network

join files with numbered index

2 Answers 2

You must log in to answer this question.

Hot Network Questions

join files with numbered index

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions