adding columns to the appropriate rows by matching the first two columns

Question

I want to use the first two columns of each file to add additional columns from fil2, file3, and file4 to file1 at the appropriate rows (where the first two columns match). File2 has three columns to add to file1, but all other files have just one column to add, the last one.

The entries NW_456 44 and NW_987 75 are not annotated in file3 and thus missing. I would like to keep this empty in the output file for that particular column (without it actually saying 'empty').

Example:

file1

NW_1234 23 NW_1234 29 NW_1234 778 NW_456 44 NW_987 75 NW_987 98 NW_5000 105 NW_5500 37 NW_5500 900

file2

NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0

file3

NW_1234 23 DOCK NW_1234 29 DOCK NW_1234 778 DOCK NW_987 98 TFEC NW_5000 105 MIN NW_5500 37 LIPG NW_5500 900 MYC

FILE4

NW_1234 23 intron_region NW_1234 29 intron_region NW_1234 778 intron_region NW_456 44 intergenic NW_987 75 intergenic NW_987 98 intron_region NW_5000 105 intron_region NW_5500 37 intron_region NW_5500 900 intron_region

outputfile

NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 DOCK intron_region NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 DOCK intron_region NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 DOCK intron_region NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 (empty) intergenic NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 (empty) intergenic NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 TFEC intron_region NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 MIN intron_region NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 LIPG intron_region NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 MYC intron_region

Similar to this question: Adding column based on matching of second column

Any help is appreciated!

What are your files' delimiters? What exactly do you want to happen for the missing entries? should there be an empty (but delimited) field, or is it acceptable for such cases to have fewer columns? Is perl available? — steeldriver
– steeldriver, Commented Nov 21, 2019 at 0:44

Ed Morton · Accepted Answer · 2019-11-21 01:36:14Z

Using tabs as the output field separators this will work using any awk in any shell on every UNIX box:

$ cat tst.awk BEGIN { OFS="\t"; } FNR==1 { fileNr++ } { key = $1 OFS $2 if (NR == FNR) { keys[++numKeys] = key } else { sub(/([^[:space:]]+[[:space:]]+){2}/,"") $1 = $1 vals[key,fileNr] = $0 } } END { for (keyNr=1; keyNr<=numKeys; keyNr++) { key = keys[keyNr] printf "%s", key for (fileNr=2; fileNr<ARGC; fileNr++) { printf "%s%s", OFS, vals[key,fileNr] } print "" } }

.

$ awk -f tst.awk file1 file2 file3 file4 NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 DOCK intron_region NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 DOCK intron_region NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 DOCK intron_region NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 intergenic NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 intergenic NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 TFEC intron_region NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 MIN intron_region NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 LIPG intron_region NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 MYC intron_region

To have the spaces be blanks instead (far less useful for further tool parsing) just pipe to column:

$ awk -f tst.awk file1 file2 file3 file4 | column -s$'\t' -t NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 DOCK intron_region NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 DOCK intron_region NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 DOCK intron_region NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 intergenic NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 intergenic NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 TFEC intron_region NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 MIN intron_region NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 LIPG intron_region NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 MYC intron_region

Stack Exchange Network

adding columns to the appropriate rows by matching the first two columns

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

adding columns to the appropriate rows by matching the first two columns

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions