matched and not matched ID first column

Question

I have two input.txt files on a Linux machine containing different columns (, as separator). I have a script to join these files using the ID reported in each's first column. This script preserves files all IDs of the first file in the output and only the matched ID of the second one. I need to implement this script adding an option to also preserve the ID of the second files not matched with the ID of the first one.

Example:

2931,C,-9.750,-2.550,57.910,-0.3,C 2932,C,-5.470,-0.200,51.550,0.9,C 2940,C,-10.860,-3.400,54.000,0.7,C 2941,S,-11.820,-13.550,55.070,2.1,S 2944,H,-3.770,-4.180,60.300,0.7,H

input2.txt

4304,N,-9.700,-7.680,58.330,-2.3,N 2940,S,-10.440,-3.450,54.270,2.2,S 2900,C,-13.655,-13.730,59.405,-1.5,C 2931,C,-9.910,-2.420,57.610,0.2,C

cmd:

join -t, -a1 -o auto <(sort input1.txt) <(sort input2.txt) > output.txt.txt

output.txt

2931,C,-9.750,-2.550,57.910,-0.3,C,2931,C,-9.910,-2.420,57.610,0.2,C 2932,C,-5.470,-0.200,51.550,0.9,C,,,,,,, 2940,C,-10.860,-3.400,54.000,0.7,C,2940,S,-10.440,-3.450,54.270,2.2,S 2941,S,-11.820,-13.550,55.070,2.1,S,,,,,,, 2944,H,-3.770,-4.180,60.300,0.7,H,,,,,,,

I'd like to modify the command to obtain two output files. The first should be similar to what I get now, but it should also have the IDs that were not matched:

output_final.txt

2931,C,-9.750,-2.550,57.910,-0.3,C,2931,C,-9.910,-2.420,57.610,0.2,C 2932,C,-5.470,-0.200,51.550,0.9,C,,,,,,, 2940,C,-10.860,-3.400,54.000,0.7,C,2940,S,-10.440,-3.450,54.270,2.2,S 2941,S,-11.820,-13.550,55.070,2.1,S,,,,,,, 2944,H,-3.770,-4.180,60.300,0.7,H,,,,,,, ,,,,,,,2900,C,-13.655,-13.730,59.405,-1.5,C ,,,,,,,4304,N,-9.700,-7.680,58.330,-2.3,N

The other output file should contain only the non-matching rows of input2.txt:

output2.txt

2900,C,-13.655,-13.730,59.405,-1.5,C 4304,N,-9.700,-7.680,58.330,-2.3,N

Moreover, if in input2.txt I would like to replace the element of the last column only with the string "P" for rows that have an ID equal or grated than 4000, how can I do?

i.e I would like to replace for only the first row (ID = 4304) the last "C" with "P"

output.txt

4304,N,-9.700,-7.680,58.330,-2.3,P 2940,S,-10.440,-3.450,54.270,2.2,S 2900,C,-13.655,-13.730,59.405,-1.5,C 2931,C,-9.910,-2.420,57.610,0.2,C

DanieleGrassini · Accepted Answer · 2022-03-13 14:25:41Z

Job one :

Assuming that the ID is unique inside the file you could use awk as follows :

awk -F, -v OFS=, ' NR == FNR { m[$1] = $0 while (i++ <= NF) empty = OFS empty next } !m[$1]{$0 = $0 OFS empty} m[$1]{$0 = $0 OFS m[$1];delete m[$1]} 1 END{ for ( i in m ) if(m[i]) print empty, m[i] } ' file2 file1

Note that you don't need to sort the file. Any time a common fields is encountered remove it from the array. At the end the array will hold only that are just in file2

Job 2 :

awk -F, 'NR == FNR {m[$1];next} !($1 in m)' file1 file2

Put the previous two inside a shell script with output redirection :

#!/bin/bash # first awk cmd ... > output1.txt # Second awk cmd ... > output2.txt

Thanks, it works! Ad if in input1.txt I would like to replace the element of the last only for rows that have an ID equal or grated than 4000? — Tommaso
– Tommaso, Commented Mar 14, 2022 at 9:32

terdon · Accepted Answer · 2022-03-13 13:30:50Z

You can get the first output file you want, the one with all IDs from both files, by telling join to include all fields:

$ join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \ <(sort input1.txt) <(sort input2.txt) ,,,,,,,2900,C,C,-13.730,59.405,-1.5,C 2931,C,C,-2.550,57.910,-0.3,C,2931,C,C,-2.420,57.610,0.2,C 2932,C,C,-0.200,51.550,0.9,C,,,,,,, 2940,C,C,-3.400,54.000,0.7,C,2940,S,S,-3.450,54.270,2.2,S 2941,S,S,-13.550,55.070,2.1,S,,,,,,, 2944,H,H,-4.180,60.300,0.7,H,,,,,,, ,,,,,,,4304,N,N,-7.680,58.330,-2.3,N

Note that the order is different than what you show because this is the order found in the files (,,,,,,,2900... appears first in sort input2.txt).

Then, you can get the second output file by parsing the first and looking or lines starting with one or more , characters:

$ join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \ <(sort input1.txt) <(sort input2.txt) | grep -oP '^,+\K.*' 2900,C,C,-13.730,59.405,-1.5,C 4304,N,N,-7.680,58.330,-2.3,N

The -o option tells grep to only print the matching part of the line and -P enables Perl Compatible Regular Expressoins. PCREs then give us \K which means "ignore anything matched up to here" and this lets us print only the part after the stretch of ,.

You can combine these into a single command making two files using tee to both write the first output to a file and also write it to stdout which lets you then run grep as shown above:

join -t, -a1 -a2 -o 1.1,1.2,1.2,1.4,1.5,1.6,1.7,2.1,2.2,2.2,2.4,2.5,2.6,2.7 \ <(sort input1.txt) <(sort input2.txt) | tee output1.txt | grep -oP '^,+\K.*' > output2.txt

The final output is:

$ cat output1.txt ,,,,,,,2900,C,C,-13.730,59.405,-1.5,C 2931,C,C,-2.550,57.910,-0.3,C,2931,C,C,-2.420,57.610,0.2,C 2932,C,C,-0.200,51.550,0.9,C,,,,,,, 2940,C,C,-3.400,54.000,0.7,C,2940,S,S,-3.450,54.270,2.2,S 2941,S,S,-13.550,55.070,2.1,S,,,,,,, 2944,H,H,-4.180,60.300,0.7,H,,,,,,, ,,,,,,,4304,N,N,-7.680,58.330,-2.3,N $ cat output2.txt 2900,C,C,-13.730,59.405,-1.5,C 4304,N,N,-7.680,58.330,-2.3,N

Stack Exchange Network

matched and not matched ID first column

2 Answers 2

You must log in to answer this question.

Hot Network Questions

matched and not matched ID first column

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions