1

File:

chr1_156186369 chr1_156186369_A_C,T A C,T 33150.29 1/2:0,4,6:10:88:272 chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167 chrM_2619 chrM_2619_A_G,T A G,T 33023.29 1/2:0,5,5:10:99:293,144,129 chr9_119375271 chr9_119375271_T_A,G T A,G 248.29 1/2:1,11,5:17:99:359,107,113 

I need to remove the comma from column 2 and 4 only and print the entire row for the words which are place after the comma.

Expected output is:

chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272 chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272 chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167 chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129 chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129 chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113 chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113 

I tried awk but not get any result, also I read the similar type of question here How to extract line from the file on specific condition

1
  • 2
    Would be nice to see what you tried. Commented Nov 7, 2016 at 10:21

4 Answers 4

2

Using awk:

awk '{ split ($2,w2,","); split ($4,w4,","); for (i in w4) { print $1,substr(w2[1],0,length(w2[1])-length(w4[i])) w4[i],$3,w4[i],$5,$6; }}' 

Note there is no error handling in case the values after comma are not equal for column 2 and 4.

1

With sed assuming the single character separated values like C,T are repeated

$ sed -E 's/^(.*)([A-Z]),([A-Z])(.*)\2,\3(.*)/\1\2\4\2\5\n\1\3\4\3\5/' ip.txt chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272 chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272 chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167 chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129 chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129 chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113 chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113 
  • ^(.*) starting text
  • ([A-Z]),([A-Z]) comma separated single characters
  • (.*) text in between the repetition
  • \2,\3 match the comma separated single characters again
  • (.*) rest of line
  • \1\2\4\2\5\n\1\3\4\3\5 required output format
  • Note that spacing doesn't exactly match with expected output
1
  • your one line code is awesome, thanks for your help Commented Nov 8, 2016 at 8:08
0

I don't know how to do it with a single command, but it works with this loop in bash:

cat data.dat | while read line do if echo "${line}" | grep -q '[[:alpha:]],[[:alpha:]]' then letters=`echo "${line}" | grep -o '[[:alpha:]],[[:alpha:]]' | head -n 1` for letter in `echo ${letters} | sed 's/,/ /g'` do echo "${line}" | sed 's/'"${letters}"'/'"${letter}"' /g' done else echo "${line}" fi done 
0
0

Split the 4th field on the comma and use the slices in that column, as well as to replace the last _X,Y into _slice, if there are any:

awk '{ n=split($4,slices,",") for(i=1;i<=n;i++) { res=$2 sub(/.,.*/,slices[i],res) print $1, res, $3, slices[i], $5, $6 } }' file 

I don't like very much how I print the fields, since I do indicate from the 1st to the 6th, so hopefully this is static.

$ awk '{n=split($4,slices,","); for(i=1;i<=n;i++) {res=$2; sub(/.,.*/,slices[i],res); print $1, res, $3, slices[i], $5, $6}}' a chr1_156186369 chr1_156186369_A_C A C 33150.29 1/2:0,4,6:10:88:272 chr1_156186369 chr1_156186369_A_T A T 33150.29 1/2:0,4,6:10:88:272 chr19_27732257 chr19_27732257_G_C G C 262.29 1/2:1,10,7:18:99:414,167 chrM_2619 chrM_2619_A_G A G 33023.29 1/2:0,5,5:10:99:293,144,129 chrM_2619 chrM_2619_A_T A T 33023.29 1/2:0,5,5:10:99:293,144,129 chr9_119375271 chr9_119375271_T_A T A 248.29 1/2:1,11,5:17:99:359,107,113 chr9_119375271 chr9_119375271_T_G T G 248.29 1/2:1,11,5:17:99:359,107,113 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.