0

I am trying to print set of two lines that do not have a corresponding pair. I ultimately want to remove these lines.

Example:

NM00123_rn5_0_1_2 XXXXXXXXXXXXXXXXXXXXXXXXXXX NM00123_mm10_0_1_2 XXXXXXXXXXXXXXXXXXXXXXXXXXXX NM00124_rn5_0_1_3 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy NM00124_mm10_0_1_3 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy NM00125_rn5_0_1_4 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz NM00126_rn5_0_1_5 RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRr NM00126_mm10_0_1_5 RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR 

The line starting with NM are headers and the next line is made of sequence of alphabets. The header lines for a pair match in all positions except for rn5 and mm10. I want to only retain sets of four lines were the NM header digits before and after rn5 and mm10 match for a pair. So from the above example: Header in line 1 for rn5 matches Header in line 3 for mm10 so keep that....but Header for rn5 at line 9 does not have a corresponding pair so print both the header and the next line with the sequence. I want finally to have a file of equal number of rn5 and mm10 entries.

I am very new to using Unix and would really appreciate help to do this. Thank you.

Expected outcome:

All the above entries sans the line without a corresponding pair. In this case:

NM00125_rn5_0_1_4 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz 
1
  • 3
    Please add expected output... and read the editing help Commented Jan 13, 2017 at 21:48

2 Answers 2

1

Here is a somewhat involved version for awk. Some differences from the sed version from steeldriver:

  1. It makes no assumptions about the ordering of the mm10 or rn5 records
  2. It can deal with a missing rn5 record
  3. It will output the unmatched records to stderr.
  4. It is a lot more code :-)

It can be run with:

awk -f my_program.awk infile 

The code:

# find and store a header /^NM.*/ { header = $0; next } # we found an mm10 line header ~ /_rn5/ { # get the mm10 line that matches this rn5 mm_match = header sub("_rn5", "_mm10", mm_match) # if we have a previous mm10, then print the pair if (mm_match in headers) { print header print print mm_match print headers[mm_match] delete headers[mm_match] } else { headers[header] = $0 } next } # we found an mm10 line header ~ /_mm10/ { # get the rn5 line that matches this mm10 mm_match = header sub("_mm10", "_rn5", mm_match) # if we have a previous rn5, then print the pair if (mm_match in headers) { print mm_match print headers[mm_match] print header print delete headers[mm_match] } else { headers[header] = $0 } next } 

Additionally this code can be added to the end of the file to output any unmatched lines to standard error:

# The END block is here just to output anything that was unmatched END { # dump the unmatched to stderr for (header in headers) { print header > "/dev/stderr" print headers[header] > "/dev/stderr" } } 

It can be run with:

awk -f my_program.awk infile > outfile 2> unmatched 

Which will output the requested output (via standard out) into outfile, and will output the leftover input (via standard error) into unmatched. For the details of I/O redirection in all its variety, see the chapter on Redirections in the Bash reference manual.

2
  • Hi, thank you for the response. However, there are a few things I don't understand. The output file is the "unmatched" file, right? And that is supposed to have the entries that do not have a matching pair ? Currently, Unmatched contains all of the entries from infile but the order is messed up. Outfile 2 exactly does what? The outfile generated is empty. Sorry, if my questions are too basic. Again, thanks for the help. Commented Jan 16, 2017 at 16:03
  • I restructured the answer to put the complication of the unmatched case as optional. See if that helps with your understanding. If not, let me know. Commented Jan 16, 2017 at 16:36
0

I think what you're asking for is something that

  • maintains a 4-line buffer; and
  • if the thing following rn5 (up to the next newline) matches the thing following mm10 (up to the next-but-two newline) print it and start over

It's probably an ugly way to do it, but to illustrate with GNU sed:

$ sed -n -e :a \ -e '$!N; /rn5_\(.*\)\n.*\n.*mm10_\1\n/ {p;b}' \ -e '/.*\n.*\n.*\n/ D' \ -e ba infile > outfile $ diff outfile infile 8a9,10 > NM00125_rn5_0_1_4 > zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.