Split a text file using awk

Question

The sample text file will be like this

ID Z4WTH3_9ACTN Unreviewed; 182 AA. AC Z4WTH3; A0SD0SDF; AC Z12SDFG3; ADFFGDF; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 182 AA; 20675 MW; B85D18AC3B1F0E75 CRC64; MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN // ID Z4WXU8_9ACTN Unreviewed; 203 AA. AC Z4WXU8; AC QWERDFV1; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 203 AA; 23224 MW; 35F1AE4342F6B3AC CRC64; MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG // ID Z9JHX1_9GAMM Unreviewed; 132 AA. AC Z9JHX1; SQ SEQUENCE 132 AA; 13880 MW; 0E09988C0F3ED155 CRC64; MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV //

The actual file is a 100GB file The file contains only one "ID" line and always start with "ID" line. End with "//"

"AC" line may be multiple. We have to take first element of first "AC" line as filename.

Need to split this file into multiple files based on the "//". Each file should be named as the text in the line begin with AC.

So the output files will look like

Z4WTH3.txt

ID Z4WTH3_9ACTN Unreviewed; 182 AA. AC Z4WTH3; A0SD0SDF; AC Z12SDFG3; ADFFGDF; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 182 AA; 20675 MW; B85D18AC3B1F0E75 CRC64; MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN //

Z4WXU8.txt

ID Z4WXU8_9ACTN Unreviewed; 203 AA. AC Z4WXU8; AC QWERDFV1; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 203 AA; 23224 MW; 35F1AE4342F6B3AC CRC64; MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG //

Z9JHX1.txt

ID Z9JHX1_9GAMM Unreviewed; 132 AA. AC Z9JHX1; SQ SEQUENCE 132 AA; 13880 MW; 0E09988C0F3ED155 CRC64; MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV //

please add code you tried... this Q&A is close to what you need: stackoverflow.com/questions/48984857/… — Sundeep
– Sundeep, Commented Feb 27, 2018 at 6:22

RavinderSingh13 · Accepted Answer · 2018-02-27 09:14:54Z

Following awk may help you on same.

awk '/^ID/{close(filename);val=$2;sub(/_.*/,"",val);filename=val".txt"} {print > filename}' Input_file

Solution 2nd: As per OP filename should come from string AC so adding following solution too now.

awk '/^ID/{close(filename);first=$0 ORS;next} /^AC/{val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}' Input_file

OR in case Input_file is NOT having ID tags in all sections then we could write close function in AC tag as follows:

awk '/^ID/{first=$0 ORS;next} /^AC/{close(filename);val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}' Input_file

Explanation: Adding explanation of solution too now:

awk ' /^ID/{ ##Searching string ID here if it is present in any line then do following: first=$0 ORS; ##Creating variable named first whose value is current line with ORS(output record separator). next} ##next is awk default keyword which will sip further statements. /^AC/{ ##Checking here condition if a line contains string AC then do following: close(filename); ##Closing the file which was previously written heer so that we will NOT get too many open files issues. val=$2; ##Creating variable named val and keeping its value as 2nd field of current line. sub(";","",val); ##Using sub utility of awk to subsitute semi colon with NULL in variable val here. filename=val".txt"; ##Creating variable named filename whose value is variable val and .txt(creating output file names here). print first $0 > filename; ##Printing variable first and current line in the output file here. next ##next will skip all further statements now. } { print > filename ##Printing the current lines into output file whoever are NOT satisfying the above 2 conditions. } ' Input_file ##Mentioning the Input_file name here.

This works perfectly. But I need to get filename form line start with "AC".
@SiyaDiya, please check my 2nd Solution too now and let me know if this helps you.
This works perfectly. Thank you. I would like to know one more thing. If the line start with AC contains multiple id's seperated by ";" like "AC Z4WXU8; E9PWJ4; Q6ZQB3; Q8BWI6;", then have to create files with each id and content will be same. like Z4WXU8.txt, E9PWJ4.txt, Q6ZQB3.txt, Q8BWI6.txt etc
Would moving the close to /^AC/ block make any difference? If the order of ID and AC varies, files might be left open.
@JamesBrown, yes right James sir, that is why I was asking OP if actual file is not having /^ID/ line then we could definitely put close(filename) in /^AC/ tag then.

James Brown · Accepted Answer · 2018-02-27 13:02:30Z

Another using RS (GNU awk due to multichar RS) to separate records:

$ gawk ' BEGIN { RS=ORS="\n//\n" # record separators } { for(i=1;i<=NF;i++) # go thru each field in record if($i=="AC") { # once AC found f=$(i+1) "TXT" # next one is the filename sub(/;/,".",f) # replace ; with . print > f # print to file (multiple AC:s lead to multiple files) close(f) # close to avoid problem with too many open files # overwrites files when files with same name } }' file

Files:

$ ls -l Z* -rw-r--r-- 1 james james 254 Feb 27 09:23 Z4WTH3.TXT -rw-r--r-- 1 james james 254 Feb 27 09:23 Z4WXU8.TXT -rw-r--r-- 1 james james 202 Feb 27 09:23 Z9JHX1.TXT

Inside a file:

$ cat Z9JHX1.TXT ID Z9JHX1_9GAMM Unreviewed; 132 AA. AC Z9JHX1; SQ SEQUENCE 132 AA; 13880 MW; 0E09988C0F3ED155 CRC64; MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV //

getting error while inputing a 3 GB file "awk: program limit exceeded: maximum number of fields size=32767 FILENAME="uniprot_sprot.dat" FNR=289522 NR=289522"
Sounds like your data is not like you described it. At some point there are more fields than you presented. Also, sounds like you're not using GNU awk which, to my undestanding, does not have a field limit. Good luck.

Ed Morton · Accepted Answer · 2018-02-27 13:20:38Z

With GNU awk for multi-char RS and RT:

awk -v RS='\n//\n' -v ORS= -F'[[:space:];]+' '{print $0 RT > ($7".txt")}' file

With any awk:

awk -F'[[:space:];]+' ' $1 == "AC" { out = $2".txt" } { rec = rec $0 ORS } $0 == "//" { printf "%s", rec > out close out rec = "" } ' file

Collectives™ on Stack Overflow

Split a text file using awk

3 Answers 3

11 Comments

2 Comments

Comments

Linked

Hot Network Questions