2

I have two input.txt files on a Linux machine containing different columns (, as separator). I have a script to join these files using the ID reported in each's first column. This script preserves files all IDs of the first file in the output and only the matched ID of the second one. I need to implement this script adding an option to also preserve the ID of the second files not matched with the ID of the first one.

Example:

2931,C,-9.750,-2.550,57.910,-0.3,C 2932,C,-5.470,-0.200,51.550,0.9,C 2940,C,-10.860,-3.400,54.000,0.7,C 2941,S,-11.820,-13.550,55.070,2.1,S 2944,H,-3.770,-4.180,60.300,0.7,H 

input2.txt

4304,N,-9.700,-7.680,58.330,-2.3,N 2940,S,-10.440,-3.450,54.270,2.2,S 2900,C,-13.655,-13.730,59.405,-1.5,C 2931,C,-9.910,-2.420,57.610,0.2,C 

cmd:

join -t, -a1 -o auto <(sort input1.txt) <(sort input2.txt) > output.txt.txt 

output.txt

2931,C,-9.750,-2.550,57.910,-0.3,C,2931,C,-9.910,-2.420,57.610,0.2,C 2932,C,-5.470,-0.200,51.550,0.9,C,,,,,,, 2940,C,-10.860,-3.400,54.000,0.7,C,2940,S,-10.440,-3.450,54.270,2.2,S 2941,S,-11.820,-13.550,55.070,2.1,S,,,,,,, 2944,H,-3.770,-4.180,60.300,0.7,H,,,,,,, 

I'd like to modify the command to obtain two output files. The first should be similar to what I get now, but it should also have the IDs that were not matched:

output_final.txt

2931,C,-9.750,-2.550,57.910,-0.3,C,2931,C,-9.910,-2.420,57.610,0.2,C 2932,C,-5.470,-0.200,51.550,0.9,C,,,,,,, 2940,C,-10.860,-3.400,54.000,0.7,C,2940,S,-10.440,-3.450,54.270,2.2,S 2941,S,-11.820,-13.550,55.070,2.1,S,,,,,,, 2944,H,-3.770,-4.180,60.300,0.7,H,,,,,,, ,,,,,,,2900,C,-13.655,-13.730,59.405,-1.5,C ,,,,,,,4304,N,-9.700,-7.680,58.330,-2.3,N 

The other output file should contain only the non-matching rows of input2.txt:

output2.txt

2900,C,-13.655,-13.730,59.405,-1.5,C 4304,N,-9.700,-7.680,58.330,-2.3,N 

Moreover, if in input2.txt I would like to replace the element of the last column only with the string "P" for rows that have an ID equal or grated than 4000, how can I do?

i.e I would like to replace for only the first row (ID = 4304) the last "C" with "P"

output.txt

4304,N,-9.700,-7.680,58.330,-2.3,P 2940,S,-10.440,-3.450,54.270,2.2,S 2900,C,-13.655,-13.730,59.405,-1.5,C 2931,C,-9.910,-2.420,57.610,0.2,C 

2 Answers 2

3

Job one :

Assuming that the ID is unique inside the file you could use awk as follows :

awk -F, -v OFS=, ' NR == FNR { m[$1] = $0 while (i++ <= NF) empty = OFS empty next } !m[$1]{$0 = $0 OFS empty} m[$1]{$0 = $0 OFS m[$1];delete m[$1]} 1 END{ for ( i in m ) if(m[i]) print empty, m[i] } ' file2 file1 

Note that you don't need to sort the file. Any time a common fields is encountered remove it from the array. At the end the array will hold only that are just in file2


Job 2 :

awk -F, 'NR == FNR {m[$1];next} !($1 in m)' file1 file2 

Put the previous two inside a shell script with output redirection :

#!/bin/bash # first awk cmd ... > output1.txt # Second awk cmd ... > output2.txt 
1
  • Thanks, it works! Ad if in input1.txt I would like to replace the element of the last only for rows that have an ID equal or grated than 4000? Commented Mar 14, 2022 at 9:32
3

You can get the first output file you want, the one with all IDs from both files, by telling join to include all fields:

$ join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \ <(sort input1.txt) <(sort input2.txt) ,,,,,,,2900,C,C,-13.730,59.405,-1.5,C 2931,C,C,-2.550,57.910,-0.3,C,2931,C,C,-2.420,57.610,0.2,C 2932,C,C,-0.200,51.550,0.9,C,,,,,,, 2940,C,C,-3.400,54.000,0.7,C,2940,S,S,-3.450,54.270,2.2,S 2941,S,S,-13.550,55.070,2.1,S,,,,,,, 2944,H,H,-4.180,60.300,0.7,H,,,,,,, ,,,,,,,4304,N,N,-7.680,58.330,-2.3,N 

Note that the order is different than what you show because this is the order found in the files (,,,,,,,2900... appears first in sort input2.txt).

Then, you can get the second output file by parsing the first and looking or lines starting with one or more , characters:

$ join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \ <(sort input1.txt) <(sort input2.txt) | grep -oP '^,+\K.*' 2900,C,C,-13.730,59.405,-1.5,C 4304,N,N,-7.680,58.330,-2.3,N 

The -o option tells grep to only print the matching part of the line and -P enables Perl Compatible Regular Expressoins. PCREs then give us \K which means "ignore anything matched up to here" and this lets us print only the part after the stretch of ,.

You can combine these into a single command making two files using tee to both write the first output to a file and also write it to stdout which lets you then run grep as shown above:

join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \ <(sort input1.txt) <(sort input2.txt) | tee output1.txt | grep -oP '^,+\K.*' > output2.txt 

The final output is:

$ cat output1.txt ,,,,,,,2900,C,C,-13.730,59.405,-1.5,C 2931,C,C,-2.550,57.910,-0.3,C,2931,C,C,-2.420,57.610,0.2,C 2932,C,C,-0.200,51.550,0.9,C,,,,,,, 2940,C,C,-3.400,54.000,0.7,C,2940,S,S,-3.450,54.270,2.2,S 2941,S,S,-13.550,55.070,2.1,S,,,,,,, 2944,H,H,-4.180,60.300,0.7,H,,,,,,, ,,,,,,,4304,N,N,-7.680,58.330,-2.3,N $ cat output2.txt 2900,C,C,-13.730,59.405,-1.5,C 4304,N,N,-7.680,58.330,-2.3,N 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.