Perl code for splitting a file, if 16s and 23s is present. and copy into a single file

Question

I have a file from which I want to search for the string "16S" and "23S" and extract the section containing these strings into two separate files.

Input file:

start description Human 16S rRNA **some text** **some text** // start description Mouse 18S rRNA some text some text // start description Mouse 23S rRNA some text some text //

Expected output: File1 for 16S:

start description Human 16S rRNA some text some text //

File2 for 23S:

start description Mouse 23S rRNA some text some text //

My code used:

#! /usr/bin/perl # default output file is /dev/null - i.e. dump any input before # the first [ entryN ] line. $outfile='FullrRNA.gb'; open(OUTFILE,">",$outfile) || die "couldn't open $outfile: $!"; while(<>) { # uncomment next two lines to optionally remove comments (startin with # '#') and skip blank lines. Also removes leading and trailing # whitespace from each line. # s/#.*|^\s*|\s*$//g; # next if (/^$/) # if line begins with 'start', extract the filename if (m/^\start/) { (undef,$outfile,undef) = split ; close(OUTFILE); open(OUTFILE,">","$outfile.txt") || die "couldn't open $outfile.txt: $!"; } else { print OUTFILE; } } close(OUTFILE);

Kusalananda · Accepted Answer · 2017-08-28 10:53:03Z

I'd solve this with awk rather than by Perl, sorry.

/^\/\// && file { file = file ".out"; print section ORS $0 >file; file = "" } /^description/ && match($0, p) && file = substr($0,RSTART,RLENGTH) {} /^start/ { section = $0; next } { section = section ORS $0 }

Running it on your data (you use p='expression' to pick out the sections that you want):

$ awk -f script.awk p='16S|23S' file.in $ ls -l total 16 -rw-r--r-- 1 kk wheel 64 Aug 28 12:10 16S.out -rw-r--r-- 1 kk wheel 56 Aug 28 12:10 23S.out -rw-r--r-- 1 kk wheel 176 Aug 28 11:51 file.in -rw-r--r-- 1 kk wheel 276 Aug 28 12:09 script.awk $ cat 16S.out start description Human 16S rRNA **some text** **some text** // $ cat 23S.out start description Mouse 23S rRNA some text some text //

The first block in the script executes if we find a end-of-section marker (a line starting with //) and if the output filename (file) is non-empty. It appends .out to the current filename and outputs the saved section followed by the current input line to the file. It then empties the file variable.

The second block is empty, but the pattern will match lines starting with description and will go on to match the line against the regular expression given on the command line (p). If it matches, the part that matches will be picked out and used as the filename.

The third block executes if we find a line starting with the word start and it just sets the saved section text to the current line, discarding any old text that was saved therein. It then skips to the beginning of the script and considers the next input line.

The last block is executed for all other lines in the file and it appends the current line to the currently saved section.

Stéphane Chazelas · Accepted Answer · 2017-08-28 10:33:17Z

If you can rely on <LF>//<LF> as a record separator, then with GNU awk, that could be just:

gawk -v 'RS=\n//\n' ' {ORS=RT}; / 16S /{print > "file1"}; / 23S /{print > "file2"}' < file

Stack Exchange Network

Perl code for splitting a file, if 16s and 23s is present. and copy into a single file

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Perl code for splitting a file, if 16s and 23s is present. and copy into a single file

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions