1

I have a file (test.bed) that looks like this (which might not be tab-seperated):

chr1 10002 10116 id=1;frame=0;strand=+; 0 + chr1 10116 10122 id=2;frame=0;strand=+; 0 + chr1 10122 10128 id=3;frame=0;strand=+; 0 + chr1 10128 10134 id=4;frame=0;strand=+; 0 + chr1 10134 10140 id=5;frame=0;strand=+; 0 + chr1 10140 10146 id=6;frame=0;strand=+; 0 + chr1 10146 10182 id=7;frame=0;strand=+; 0 + chr1 10182 10188 id=8;frame=0;strand=+; 0 + chr1 10188 10194 id=9;frame=0;strand=+; 0 + chr1 10194 10200 id=10;frame=0;strand=+; 0 + 

I want to produce the following output (which should be tab-seperated):

chr1 10002 10116 id=1 0 + chr1 10116 10122 id=2 0 + chr1 10122 10128 id=3 0 + chr1 10128 10134 id=4 0 + chr1 10134 10140 id=5 0 + chr1 10140 10146 id=6 0 + chr1 10146 10182 id=7 0 + chr1 10182 10188 id=8 0 + chr1 10188 10194 id=9 0 + chr1 10194 10200 id=10 0 + 

I have tried with the following code:

awk 'OFS="\t" split ($0, a, ";"){print a[1],$5,$6}' test.bed 

But then I get:

chr1 10002 10116 id=1 40 4+ chr1 10116 10122 id=2 40 4+ chr1 10122 10128 id=3 40 4+ chr1 10128 10134 id=4 40 4+ chr1 10134 10140 id=5 40 4+ chr1 10140 10146 id=6 40 4+ chr1 10146 10182 id=7 40 4+ chr1 10182 10188 id=8 40 4+ chr1 10188 10194 id=9 40 4+ chr1 10194 10200 id=10 40 4+ 

What am I doing wrong? Somehow the number '4' is added to the last two fields. I thought the number '4' somehow might have something to do with splitting in the 4th field, however, I tried producing a similar file where it was the 3rd field that was split, and still got the number '4' added to the last two fields. I am rather new to 'awk' so I guess it is an error in the syntax. Any help would be appreciated.

1
  • 1
    try sed 's/;frame=0;strand=+;//' Commented May 14, 2013 at 9:21

2 Answers 2

1

If you set your field separator as whitespace or semi-columns you won't have to handle the splitting yourself:

$ awk '{print $1,$2,$3,$4,$8,$9}' FS='[[:space:]]+|;' OFS='\t' file chr1 10002 10116 id=1 0 + chr1 10116 10122 id=2 0 + chr1 10122 10128 id=3 0 + chr1 10128 10134 id=4 0 + chr1 10134 10140 id=5 0 + chr1 10140 10146 id=6 0 + chr1 10146 10182 id=7 0 + chr1 10182 10188 id=8 0 + chr1 10188 10194 id=9 0 + chr1 10194 10200 id=10 0 + 

As for what you are doing wrong in:

awk 'OFS="\t" split ($0, a, ";"){print a[1],$5,$6}' 
  • The syntax of awk is condition{block} and setting the value of OFS and splitting is not a conditional. They are statements that should be inside the block.
  • However you really don't need to set the value of OFS on every line so it should be initialized only once. You can do this using the -v option, in the BEGIN block or after the script.

Valid alternatives:

$ awk -v OFS='\t' '{split($0,a,";");print a[1],$5,$6}' file $ awk 'BEGIN{OFS="\t"}{split($0,a,";");print a[1],$5,$6}' file $ awk '{split ($0,a,";");print a[1],$5,$6}' OFS='\t' file 
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, that does the job. Any idea what happens in my code to produce the number 4?
it's the return value from the split. you wrote the awk argument in improper format. All your actions should be inside {..}, I just changed your awk like this awk 'OFS="\t" {split ($0, a, ";");print a[1],$5,$6}' notice the { moved before split, and it worked properly
Thank you for the explanation, that was very helpful. However, I guess this is not quite the way to do it after all, as this only tab-seperates the last fields..
1

Try this :

awk -F\; '{print $1,$4}' test.bed 

2 Comments

This won't allow the output to be separated as required.
And this works as well - but i guess I will have to specify output if input isn't tab seperated.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.