Join two files, matching on a column, with repetitions

Question

How can I get two files A and B, and out put a result like this:

File A:

001 Apple, CA 020 Banana, CN 023 Apple, LA 045 Orange, TT 101 Orange, OS 200 Kiwi, AA

File B:

01-Dec-2013 01.664 001 AAA CAC 1083 01-Dec-2013 01.664 020 AAA CAC 0513 01-Dec-2013 01.668 023 AAA CAC 1091 01-Dec-2013 01.668 101 AAA CAC 0183 01-Dec-2013 01.674 200 AAA CAC 0918 01-Dec-2013 01.674 045 AAA CAC 0918 01-Dec-2013 01.664 001 AAA CAC 2573 01-Dec-2013 01.668 101 AAA CAC 1091 01-Dec-2013 01.668 020 AAA CAC 6571 01-Dec-2013 01.668 023 AAA CAC 2148 01-Dec-2013 01.674 200 AAA CAC 0918 01-Dec-2013 01.668 045 AAA CAC 5135

Result:

01-Dec-2013 01.664 001 AAA CAC 1083 Apple, CA 01-Dec-2013 01.664 020 AAA CAC 0513 Banana, CN 01-Dec-2013 01.668 023 AAA CAC 1091 Apple, LA 01-Dec-2013 01.668 101 AAA CAC 0183 Orange, OS 01-Dec-2013 01.674 200 AAA CAC 0918 Kiwi, AA 01-Dec-2013 01.674 045 AAA CAC 0918 Orange, TT 01-Dec-2013 01.664 001 AAA CAC 2573 Apple, CA 01-Dec-2013 01.668 101 AAA CAC 1091 Orange, OS 01-Dec-2013 01.668 020 AAA CAC 6571 Banana, CN 01-Dec-2013 01.668 023 AAA CAC 2148 Apple, LA 01-Dec-2013 01.674 200 AAA CAC 0918 Kiwi, AA 01-Dec-2013 01.668 045 AAA CAC 5135 Orange, TT

(file A: the number should match to middle number from file B)

Is there any possible way to doing this?

user20877 · Accepted Answer · 2014-01-25 13:14:15Z

A simple solution with awk:

awk -v FILE_A="file-A" -v OFS="\t" 'BEGIN { while ( ( getline < FILE_A ) > 0 ) { VAL = $0 ; sub( /^[^ ]+ /, "", VAL ) ; DICT[ $1 ] = VAL } } { print $0, DICT[ $3 ] }' file-B

Here is a commented version:

awk -v FILE_A="file-A" -v OFS="\t" ' BEGIN { # Loop on the content of file-A # to put the values in a table while ( ( getline < FILE_A ) > 0 ){ # Remove the index from the value VAL = $0 sub( /^[^ ]+ /, "", VAL ) # Fill the table DICT[ $1 ] = VAL } } { # Print the line followed by the # corresponding value print $0, DICT[ $3 ] }' file-B

@ Jean, Thanks you for your answer. :) I got a best result from your help. — JOSS
– JOSS, Commented Jan 25, 2014 at 13:31
@JOSS, If you're accepting an awk answer, you should remove the bash-script tag. — Ricky
– Ricky, Commented Jan 25, 2014 at 21:29

slm · Accepted Answer · 2014-01-25 06:05:33Z

Here's a Bash script that does what you're looking for. The script's called mergeAB.bash.

#!/bin/bash readarray A < fileA.txt i=0 while read -r B; do idx=$(( $i % ${#A[@]} )) printf "%s %s" "$B" "${A[$idx]}" #echo "i: $i | A#: ${#A[@]} | IDX: $idx" let i=i+1 done < fileB.txt

When you run it:

$ ./mergeAB.bash 01-Dec-2013 01.664 001 AAA CAC 1083 001 Apple, CA 01-Dec-2013 01.664 020 AAA CAC 0513 020 Banana, CN 01-Dec-2013 01.668 023 AAA CAC 1091 023 Apple, LA 01-Dec-2013 01.668 101 AAA CAC 0183 045 Orange, TT 01-Dec-2013 01.674 200 AAA CAC 0918 101 Orange, OS 01-Dec-2013 01.674 045 AAA CAC 0918 200 Kiwi, AA 01-Dec-2013 01.664 001 AAA CAC 2573 001 Apple, CA 01-Dec-2013 01.668 101 AAA CAC 1091 020 Banana, CN 01-Dec-2013 01.668 020 AAA CAC 6571 023 Apple, LA 01-Dec-2013 01.668 023 AAA CAC 2148 045 Orange, TT 01-Dec-2013 01.674 200 AAA CAC 0918 101 Orange, OS 01-Dec-2013 01.668 045 AAA CAC 5135 200 Kiwi, AA

Details

The very first thing we do is use the command readarray to read the contents of fileA.txt into an array. This is a newer feature of Bash 4.x, so if you're using an older version of Bash you can use something like this instead:

$ IFS=$'\n' read -d '' -r -a A < fileA.txt

The rest of this script's a bit complex but I've left a verbose echo in the middle that you can un-comment to see what's going on.

$ ./mergeAB.bash | grep i: i: 0 | A#: 6 | IDX: 0 i: 1 | A#: 6 | IDX: 1 i: 2 | A#: 6 | IDX: 2 i: 3 | A#: 6 | IDX: 3 i: 4 | A#: 6 | IDX: 4 i: 5 | A#: 6 | IDX: 5 i: 6 | A#: 6 | IDX: 0 i: 7 | A#: 6 | IDX: 1 i: 8 | A#: 6 | IDX: 2 i: 9 | A#: 6 | IDX: 3 i: 10 | A#: 6 | IDX: 4 i: 11 | A#: 6 | IDX: 5

What's going on here? There's an counter, $i that we use to count each line from fileB.txt as we loop through it. We then calculate $idx by calculating modulo division of the current value of $i and the number of lines in fileA.txt.

NOTE: the length of the array A. By calculating $idx this way we're able to make it "loop" around from 0 to 5, then 0 to 5 etc. In the debug output above you can see this with the IDX: column.

The rest of the script is pretty standard, using printf to print the concatenated lines from fileB.txt with the corresponding line from fileA.txt.

Thank you!! SIM, that is what i need also learn more about it — JOSS
– JOSS, Commented Jan 25, 2014 at 6:10
file A: the number should match to middle number from file B. do you know how to fix it?? — JOSS
– JOSS, Commented Jan 25, 2014 at 7:03

ckujau · Accepted Answer · 2014-01-25 03:58:40Z

 $ cat b | while read b; do key=$(echo $b | awk '{print $3}'); /bin/echo -n "$b "; grep -w $key a | cut -d\ -f2-; done 01-Dec-2013 01.664 001 AAA CAC 1083 Apple, CA 01-Dec-2013 01.664 020 AAA CAC 0513 Banana, CN 01-Dec-2013 01.668 023 AAA CAC 1091 Apple, LA 01-Dec-2013 01.668 101 AAA CAC 0183 Orange, OS 01-Dec-2013 01.674 200 AAA CAC 0918 Kiwi, AA 01-Dec-2013 01.674 045 AAA CAC 0918 Orange, TT 01-Dec-2013 01.664 001 AAA CAC 2573 Apple, CA 01-Dec-2013 01.668 101 AAA CAC 1091 Orange, OS 01-Dec-2013 01.668 020 AAA CAC 6571 Banana, CN 01-Dec-2013 01.668 023 AAA CAC 2148 Apple, LA 01-Dec-2013 01.674 200 AAA CAC 0918 Kiwi, AA 01-Dec-2013 01.668 045 AAA CAC 5135 Orange, TT

I suspect the awk construct can be done in a more elegant way, but it seems to work.

thanks, this works. but is it possible to use array?

JOSS
– JOSS

2014-01-25 07:08:21 +00:00
Commented Jan 25, 2014 at 7:08 — JOSS
– JOSS, Commented Jan 25, 2014 at 7:08

bahamat · Accepted Answer · 2014-01-25 08:37:22Z

The join utility performs an "equality join" on the specified files and writes the result to the standard output. The "join field" is the field in each file by which the files are compared.

In other words, you have two files that share a column. You can join the lines of those files where the column is equal.

So let's try:

$ join -1 1 -2 3 a b 001 Apple, CA 01-Dec-2013 01.664 AAA CAC 1083 020 Banana, CN 01-Dec-2013 01.664 AAA CAC 0513 023 Apple, LA 01-Dec-2013 01.668 AAA CAC 1091 101 Orange, OS 01-Dec-2013 01.668 AAA CAC 0183 200 Kiwi, AA 01-Dec-2013 01.674 AAA CAC 0918

Yep, works. But not in the format you specified. So let's swap the files:

$ join -1 3 -2 1 b a 001 01-Dec-2013 01.664 AAA CAC 1083 Apple, CA 020 01-Dec-2013 01.664 AAA CAC 0513 Banana, CN 023 01-Dec-2013 01.668 AAA CAC 1091 Apple, LA 101 01-Dec-2013 01.668 AAA CAC 0183 Orange, OS 200 01-Dec-2013 01.674 AAA CAC 0918 Kiwi, AA

Much better. Still not quite right, since the joined field shows up first. Awk can fix that up:

$ join -1 3 -2 1 b a | awk '{print $2,$3,$1,$4,$5,$6,$7,$8}' 01-Dec-2013 01.664 001 AAA CAC 1083 Apple, CA 01-Dec-2013 01.664 020 AAA CAC 0513 Banana, CN 01-Dec-2013 01.668 023 AAA CAC 1091 Apple, LA 01-Dec-2013 01.668 101 AAA CAC 0183 Orange, OS 01-Dec-2013 01.674 200 AAA CAC 0918 Kiwi, AA

So there you go. The fields are in the same order. In awk you can use printf or insert some tabs if you want to get the spacing exact, but I think you'll get the idea.

Note that you need to sort the input files on the join field for join to work properly. — Stéphane Chazelas
– Stéphane Chazelas, Commented Jan 25, 2014 at 8:49
join isn't quite right for the question, either. There are more lines of B than A; join won't output all the lines. And it destroys the fixed field widths (i.e. eats spaces) — Ricky
– Ricky, Commented Jan 25, 2014 at 9:20
@RickyBeam - Wrong. join is definitely the right tool for this job: join -1 1 -2 3 -o 2.1 2.2 2.3 2.4 2.5 2.6 1.2 1.3 fileA <(sort -k3 fileB) . You could even preserve the order of lines in fileB and the spacing if you so wished, search my posts under the join tag if you're curious to see how. — don_crissti
– don_crissti, Commented Sep 26, 2015 at 15:56
Important: FILE1 and FILE2 must be sorted on the join fields. That rather solidly kills the ordering. join is a poor tool for the task; you have not proven otherwise. — Ricky
– Ricky, Commented Sep 26, 2015 at 19:56
@RickyBeam - in the OP example file1 is already sorted though in general both have to be sorted so I completely agree here. The fact that sort "solidly kills the ordering" is irrelevant because you can re-sort the output back to the initial order. That is, if you're smart enough. I don't feel the need to prove you anything but here are a few examples for you to read 1,2,3. — don_crissti
– don_crissti, Commented Sep 26, 2015 at 20:24

Ricky · Accepted Answer · 2014-01-25 09:06:14Z

With an array, as requested (entirely in bash)...

while read num loc; do A[0x$num]=$loc; done < A while read B; do set -- $B; echo "${B} ${A[0x$3]}"; done < B

(works in bash v2)

The first line loads the array "A" from file A. The 0x$num bit is to keep everything in the same number base otherwise the leading zeros makes them octal. The second line reads each line of file B (preserving spaces), sets the positional args from that line, and finally prints the line plus the indexed entry from "A".

Stack Exchange Network

Join two files, matching on a column, with repetitions

5 Answers 5

Details

You must log in to answer this question.

Linked

Hot Network Questions

Join two files, matching on a column, with repetitions

5 Answers 5

Details

You must log in to answer this question.

Linked

Related

Hot Network Questions