0

I have a file from which I want to search for the string "16S" and "23S" and extract the section containing these strings into two separate files.

Input file:

start description Human 16S rRNA **some text** **some text** // start description Mouse 18S rRNA some text some text // start description Mouse 23S rRNA some text some text // 

Expected output: File1 for 16S:

start description Human 16S rRNA some text some text // 

File2 for 23S:

start description Mouse 23S rRNA some text some text // 

My code used:

#! /usr/bin/perl # default output file is /dev/null - i.e. dump any input before # the first [ entryN ] line. $outfile='FullrRNA.gb'; open(OUTFILE,">",$outfile) || die "couldn't open $outfile: $!"; while(<>) { # uncomment next two lines to optionally remove comments (startin with # '#') and skip blank lines. Also removes leading and trailing # whitespace from each line. # s/#.*|^\s*|\s*$//g; # next if (/^$/) # if line begins with 'start', extract the filename if (m/^\start/) { (undef,$outfile,undef) = split ; close(OUTFILE); open(OUTFILE,">","$outfile.txt") || die "couldn't open $outfile.txt: $!"; } else { print OUTFILE; } } close(OUTFILE); 

2 Answers 2

1

I'd solve this with awk rather than by Perl, sorry.

/^\/\// && file { file = file ".out"; print section ORS $0 >file; file = "" } /^description/ && match($0, p) && file = substr($0,RSTART,RLENGTH) {} /^start/ { section = $0; next } { section = section ORS $0 } 

Running it on your data (you use p='expression' to pick out the sections that you want):

$ awk -f script.awk p='16S|23S' file.in $ ls -l total 16 -rw-r--r-- 1 kk wheel 64 Aug 28 12:10 16S.out -rw-r--r-- 1 kk wheel 56 Aug 28 12:10 23S.out -rw-r--r-- 1 kk wheel 176 Aug 28 11:51 file.in -rw-r--r-- 1 kk wheel 276 Aug 28 12:09 script.awk $ cat 16S.out start description Human 16S rRNA **some text** **some text** // $ cat 23S.out start description Mouse 23S rRNA some text some text // 

The first block in the script executes if we find a end-of-section marker (a line starting with //) and if the output filename (file) is non-empty. It appends .out to the current filename and outputs the saved section followed by the current input line to the file. It then empties the file variable.

The second block is empty, but the pattern will match lines starting with description and will go on to match the line against the regular expression given on the command line (p). If it matches, the part that matches will be picked out and used as the filename.

The third block executes if we find a line starting with the word start and it just sets the saved section text to the current line, discarding any old text that was saved therein. It then skips to the beginning of the script and considers the next input line.

The last block is executed for all other lines in the file and it appends the current line to the currently saved section.

0

If you can rely on <LF>//<LF> as a record separator, then with GNU awk, that could be just:

gawk -v 'RS=\n//\n' ' {ORS=RT}; / 16S /{print > "file1"}; / 23S /{print > "file2"}' < file 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.