Print sets of lines that do not have a corresponding pair

Question

I am trying to print set of two lines that do not have a corresponding pair. I ultimately want to remove these lines.

Example:

NM00123_rn5_0_1_2 XXXXXXXXXXXXXXXXXXXXXXXXXXX NM00123_mm10_0_1_2 XXXXXXXXXXXXXXXXXXXXXXXXXXXX NM00124_rn5_0_1_3 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy NM00124_mm10_0_1_3 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy NM00125_rn5_0_1_4 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz NM00126_rn5_0_1_5 RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRr NM00126_mm10_0_1_5 RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

The line starting with NM are headers and the next line is made of sequence of alphabets. The header lines for a pair match in all positions except for rn5 and mm10. I want to only retain sets of four lines were the NM header digits before and after rn5 and mm10 match for a pair. So from the above example: Header in line 1 for rn5 matches Header in line 3 for mm10 so keep that....but Header for rn5 at line 9 does not have a corresponding pair so print both the header and the next line with the sequence. I want finally to have a file of equal number of rn5 and mm10 entries.

I am very new to using Unix and would really appreciate help to do this. Thank you.

Expected outcome:

All the above entries sans the line without a corresponding pair. In this case:

NM00125_rn5_0_1_4 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

Please add expected output... and read the editing help

don_crissti
– don_crissti

2017-01-13 21:48:26 +00:00
Commented Jan 13, 2017 at 21:48 — don_crissti
– don_crissti, Commented Jan 13, 2017 at 21:48

Stephen Rauch · Accepted Answer · 2017-01-16 16:34:32Z

Here is a somewhat involved version for awk. Some differences from the sed version from steeldriver:

It makes no assumptions about the ordering of the mm10 or rn5 records
It can deal with a missing rn5 record
It will output the unmatched records to stderr.
It is a lot more code :-)

It can be run with:

awk -f my_program.awk infile

The code:

# find and store a header /^NM.*/ { header = $0; next } # we found an mm10 line header ~ /_rn5/ { # get the mm10 line that matches this rn5 mm_match = header sub("_rn5", "_mm10", mm_match) # if we have a previous mm10, then print the pair if (mm_match in headers) { print header print print mm_match print headers[mm_match] delete headers[mm_match] } else { headers[header] = $0 } next } # we found an mm10 line header ~ /_mm10/ { # get the rn5 line that matches this mm10 mm_match = header sub("_mm10", "_rn5", mm_match) # if we have a previous rn5, then print the pair if (mm_match in headers) { print mm_match print headers[mm_match] print header print delete headers[mm_match] } else { headers[header] = $0 } next }

Additionally this code can be added to the end of the file to output any unmatched lines to standard error:

# The END block is here just to output anything that was unmatched END { # dump the unmatched to stderr for (header in headers) { print header > "/dev/stderr" print headers[header] > "/dev/stderr" } }

It can be run with:

awk -f my_program.awk infile > outfile 2> unmatched

Which will output the requested output (via standard out) into outfile, and will output the leftover input (via standard error) into unmatched. For the details of I/O redirection in all its variety, see the chapter on Redirections in the Bash reference manual.

Hi, thank you for the response. However, there are a few things I don't understand. The output file is the "unmatched" file, right? And that is supposed to have the entries that do not have a matching pair ? Currently, Unmatched contains all of the entries from infile but the order is messed up. Outfile 2 exactly does what? The outfile generated is empty. Sorry, if my questions are too basic. Again, thanks for the help. — user210349
– user210349, Commented Jan 16, 2017 at 16:03
I restructured the answer to put the complication of the unmatched case as optional. See if that helps with your understanding. If not, let me know. — Stephen Rauch
– Stephen Rauch, Commented Jan 16, 2017 at 16:36

steeldriver · Accepted Answer · 2017-01-14 01:09:55Z

I think what you're asking for is something that

maintains a 4-line buffer; and
if the thing following rn5 (up to the next newline) matches the thing following mm10 (up to the next-but-two newline) print it and start over

It's probably an ugly way to do it, but to illustrate with GNU sed:

$ sed -n -e :a \ -e '$!N; /rn5_\(.*\)\n.*\n.*mm10_\1\n/ {p;b}' \ -e '/.*\n.*\n.*\n/ D' \ -e ba infile > outfile $ diff outfile infile 8a9,10 > NM00125_rn5_0_1_4 > zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

Stack Exchange Network

Print sets of lines that do not have a corresponding pair

Example:

Expected outcome:

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Print sets of lines that do not have a corresponding pair

Example:

Expected outcome:

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions