3
#CHROM POS REF ALT ../S101_sorted.bam ../S102_sorted.bam ../S105_sorted.bam ../S107_sorted.bam ../S113_sorted.bam ../S114_sorted.bam ../S115_sorted.bam ../S Aradu.A01 296611 T C T T T T T T T T T T T T T T T/C T T/C T T T T Aradu.A01 326689 T C T/C T T T T/C T T T T/C T/C T T T T T T T T/C T/C T T Aradu.A01 615910 T G T T T T T T T T T T T T T T T T T T T T T Aradu.A01 661394 T A T T T T T T/A T T T T T T T T T T T T T T T Aradu.A01 941674 C T C C/T C C C/T C C C C C C C C C C C C C C C C Aradu.A01 942064 C T C/T C/T C/T C/T C/T C C C/T C C/T C/T C C C/T C/T C C C C C/T C/T Aradu.A01 954858 G A G/A G G G G G G G G G G G G G G G G/A G G G G Aradu.A01 1196780 C A C/A C C C C C C C C C C C/A C C C/A C C C C C C 

I have a file in the above format and I am trying to print the first two columns separated by _ and rest of the columns as they are. I tried the following awk script nut it does not return any output.

awk '{if (NR>1) print $1"_"$2; for(i=3;i<NF;i++) printf "\t", $i}' input_file > out_file. 

Can any one please suggest what am I doing wrong here?

0

4 Answers 4

7

To change the whitespace between the first two columns to an underscore, I suggest sed:

 sed -e 's/[\t ]\+/_/' 

And if you were to need to ignore the header line:

sed -e '/^#/! s/[\t ]\+/_/' 

or, for the more general case (header might start with any char; \t works only with gnu sed)

sed -E '1! s/[[:blank:]]+/_/' 

As to the question about your awk code, the first print, should likely be a printf so as not to have it print an ill timed newline.

1
  • 2
    Note that all of \t, \+, -E are non-standard extensions. POSIXly: tab=$(printf '\t') for the tab character. sed "s/[[:blank:]]\{1,\}/_/". Commented Jan 27, 2017 at 15:35
3

Starting from your code, this should give you the desired output:

awk ' NR>1 { printf( $1"_"$2 ); for (i=3;i<NF;i++) printf("\t%s", $i); printf("\n") } NR==1 { print } ' input > output 
1

This appears to work:

awk '{ if(NR>1) { printf $1"_"$2; for(i=3;i<NF;i++) {printf "\t"$i } } print "" }' input 
1

Here's a small Python 3 script, which does the job. The underlying premise here is to read each line character by character use two variables - one which tracks whether the first-to-second column underscores have been written, and another - which tracks whether we're permitted to substitute space with underscore.

I've noticed from OP's input file format that the second column is all numeric values. Thus, we can start with allowing for spaces to be substituted with underscores, but once we've written underscores and encountered a numeric character ( both conditions being true), we can turn off the write_ok variable, and the other spaces will be printed out as usual.

#!/usr/bin/env python3 import sys import os def count_first_spaces(string): write_ok = True underscores_ok = False for char in string: if char == " " and write_ok: print("_",end="") underscores_ok = True continue if underscores_ok and char.isdigit(): write_ok = False print(char,end="") print("") # add newline def main(): if not os.path.isfile(sys.argv[1]): sys.exit(1) with open(sys.argv[1]) as fd: for line in fd: if line.startswith('#'): print(line.strip()) else: count_first_spaces(line.strip()) if __name__ == '__main__': main() 

And here's the test run:

$ ./add_underscore.py input.txt #CHROM POS REF ALT ../S101_sorted.bam ../S102_sorted.bam ../S105_sorted.bam ../S107_sorted.bam ../S113_sorted.bam ../S114_sorted.bam ../S115_sorted.bam ../S Aradu.A01_______296611 T C T T T T T T T T T T T T T T T/C T T/C T T T T Aradu.A01_______326689 T C T/C T T T T/C T T T T/C T/C T T T T T T T T/C T/C T T Aradu.A01_______615910 T G T T T T T T T T T T T T T T T T T T T T T Aradu.A01_______661394 T A T T T T T T/A T T T T T T T T T T T T T T T 

If you want that data to be saved to a different file, run it as ./add_underscore.py input.txt > output.txt

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.