0

I want to use the first two columns of each file to add additional columns from fil2, file3, and file4 to file1 at the appropriate rows (where the first two columns match). File2 has three columns to add to file1, but all other files have just one column to add, the last one.

The entries NW_456 44 and NW_987 75 are not annotated in file3 and thus missing. I would like to keep this empty in the output file for that particular column (without it actually saying 'empty').

Example:

file1

NW_1234 23 NW_1234 29 NW_1234 778 NW_456 44 NW_987 75 NW_987 98 NW_5000 105 NW_5500 37 NW_5500 900 

file2

NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 

file3

NW_1234 23 DOCK NW_1234 29 DOCK NW_1234 778 DOCK NW_987 98 TFEC NW_5000 105 MIN NW_5500 37 LIPG NW_5500 900 MYC 

FILE4

NW_1234 23 intron_region NW_1234 29 intron_region NW_1234 778 intron_region NW_456 44 intergenic NW_987 75 intergenic NW_987 98 intron_region NW_5000 105 intron_region NW_5500 37 intron_region NW_5500 900 intron_region 

outputfile

NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 DOCK intron_region NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 DOCK intron_region NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 DOCK intron_region NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 (empty) intergenic NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 (empty) intergenic NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 TFEC intron_region NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 MIN intron_region NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 LIPG intron_region NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 MYC intron_region 

Similar to this question: Adding column based on matching of second column

Any help is appreciated!

1
  • What are your files' delimiters? What exactly do you want to happen for the missing entries? should there be an empty (but delimited) field, or is it acceptable for such cases to have fewer columns? Is perl available? Commented Nov 21, 2019 at 0:44

1 Answer 1

1

Using tabs as the output field separators this will work using any awk in any shell on every UNIX box:

$ cat tst.awk BEGIN { OFS="\t"; } FNR==1 { fileNr++ } { key = $1 OFS $2 if (NR == FNR) { keys[++numKeys] = key } else { sub(/([^[:space:]]+[[:space:]]+){2}/,"") $1 = $1 vals[key,fileNr] = $0 } } END { for (keyNr=1; keyNr<=numKeys; keyNr++) { key = keys[keyNr] printf "%s", key for (fileNr=2; fileNr<ARGC; fileNr++) { printf "%s%s", OFS, vals[key,fileNr] } print "" } } 

.

$ awk -f tst.awk file1 file2 file3 file4 NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 DOCK intron_region NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 DOCK intron_region NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 DOCK intron_region NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 intergenic NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 intergenic NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 TFEC intron_region NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 MIN intron_region NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 LIPG intron_region NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 MYC intron_region 

To have the spaces be blanks instead (far less useful for further tool parsing) just pipe to column:

$ awk -f tst.awk file1 file2 file3 file4 | column -s$'\t' -t NW_1234 23 C 0:0:32:0:0:0 42:0:0:0:0:0 DOCK intron_region NW_1234 29 C 0:0:28:0:0:0 0:28:0:0:0:0 DOCK intron_region NW_1234 778 C 0:54:0:0:0:0 0:0:53:0:0:0 DOCK intron_region NW_456 44 G 0:0:0:45:0:0 59:0:0:0:0:0 intergenic NW_987 75 G 0:0:0:60:0:0 55:0:0:0:0:0 intergenic NW_987 98 C 0:0:63:0:0:0 0:42:0:0:0:0 TFEC intron_region NW_5000 105 G 0:0:71:0:0:0 0:50:0:0:0:0 MIN intron_region NW_5500 37 G 0:0:0:54:0:0 55:0:0:0:0:0 LIPG intron_region NW_5500 900 A 43:0:0:0:0:0 0:0:0:37:0:0 MYC intron_region 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.