0

The sample text file will be like this

ID Z4WTH3_9ACTN Unreviewed; 182 AA. AC Z4WTH3; A0SD0SDF; AC Z12SDFG3; ADFFGDF; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 182 AA; 20675 MW; B85D18AC3B1F0E75 CRC64; MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN // ID Z4WXU8_9ACTN Unreviewed; 203 AA. AC Z4WXU8; AC QWERDFV1; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 203 AA; 23224 MW; 35F1AE4342F6B3AC CRC64; MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG // ID Z9JHX1_9GAMM Unreviewed; 132 AA. AC Z9JHX1; SQ SEQUENCE 132 AA; 13880 MW; 0E09988C0F3ED155 CRC64; MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV // 

The actual file is a 100GB file The file contains only one "ID" line and always start with "ID" line. End with "//"

"AC" line may be multiple. We have to take first element of first "AC" line as filename.

Need to split this file into multiple files based on the "//". Each file should be named as the text in the line begin with AC.

So the output files will look like

Z4WTH3.txt

ID Z4WTH3_9ACTN Unreviewed; 182 AA. AC Z4WTH3; A0SD0SDF; AC Z12SDFG3; ADFFGDF; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 182 AA; 20675 MW; B85D18AC3B1F0E75 CRC64; MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN // 

Z4WXU8.txt

ID Z4WXU8_9ACTN Unreviewed; 203 AA. AC Z4WXU8; AC QWERDFV1; DT 11-JUN-2014, integrated into UniProtKB/TrEMBL. SQ SEQUENCE 203 AA; 23224 MW; 35F1AE4342F6B3AC CRC64; MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG // 

Z9JHX1.txt

ID Z9JHX1_9GAMM Unreviewed; 132 AA. AC Z9JHX1; SQ SEQUENCE 132 AA; 13880 MW; 0E09988C0F3ED155 CRC64; MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV // 
1

3 Answers 3

2

Following awk may help you on same.

awk '/^ID/{close(filename);val=$2;sub(/_.*/,"",val);filename=val".txt"} {print > filename}' Input_file 

Solution 2nd: As per OP filename should come from string AC so adding following solution too now.

awk '/^ID/{close(filename);first=$0 ORS;next} /^AC/{val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}' Input_file 

OR in case Input_file is NOT having ID tags in all sections then we could write close function in AC tag as follows:

awk '/^ID/{first=$0 ORS;next} /^AC/{close(filename);val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}' Input_file 

Explanation: Adding explanation of solution too now:

awk ' /^ID/{ ##Searching string ID here if it is present in any line then do following: first=$0 ORS; ##Creating variable named first whose value is current line with ORS(output record separator). next} ##next is awk default keyword which will sip further statements. /^AC/{ ##Checking here condition if a line contains string AC then do following: close(filename); ##Closing the file which was previously written heer so that we will NOT get too many open files issues. val=$2; ##Creating variable named val and keeping its value as 2nd field of current line. sub(";","",val); ##Using sub utility of awk to subsitute semi colon with NULL in variable val here. filename=val".txt"; ##Creating variable named filename whose value is variable val and .txt(creating output file names here). print first $0 > filename; ##Printing variable first and current line in the output file here. next ##next will skip all further statements now. } { print > filename ##Printing the current lines into output file whoever are NOT satisfying the above 2 conditions. } ' Input_file ##Mentioning the Input_file name here. 
Sign up to request clarification or add additional context in comments.

11 Comments

This works perfectly. But I need to get filename form line start with "AC".
@SiyaDiya, please check my 2nd Solution too now and let me know if this helps you.
This works perfectly. Thank you. I would like to know one more thing. If the line start with AC contains multiple id's seperated by ";" like "AC Z4WXU8; E9PWJ4; Q6ZQB3; Q8BWI6;", then have to create files with each id and content will be same. like Z4WXU8.txt, E9PWJ4.txt, Q6ZQB3.txt, Q8BWI6.txt etc
Would moving the close to /^AC/ block make any difference? If the order of ID and AC varies, files might be left open.
@JamesBrown, yes right James sir, that is why I was asking OP if actual file is not having /^ID/ line then we could definitely put close(filename) in /^AC/ tag then.
|
1

Another using RS (GNU awk due to multichar RS) to separate records:

$ gawk ' BEGIN { RS=ORS="\n//\n" # record separators } { for(i=1;i<=NF;i++) # go thru each field in record if($i=="AC") { # once AC found f=$(i+1) "TXT" # next one is the filename sub(/;/,".",f) # replace ; with . print > f # print to file (multiple AC:s lead to multiple files) close(f) # close to avoid problem with too many open files # overwrites files when files with same name } }' file 

Files:

$ ls -l Z* -rw-r--r-- 1 james james 254 Feb 27 09:23 Z4WTH3.TXT -rw-r--r-- 1 james james 254 Feb 27 09:23 Z4WXU8.TXT -rw-r--r-- 1 james james 202 Feb 27 09:23 Z9JHX1.TXT 

Inside a file:

$ cat Z9JHX1.TXT ID Z9JHX1_9GAMM Unreviewed; 132 AA. AC Z9JHX1; SQ SEQUENCE 132 AA; 13880 MW; 0E09988C0F3ED155 CRC64; MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV // 

2 Comments

getting error while inputing a 3 GB file "awk: program limit exceeded: maximum number of fields size=32767 FILENAME="uniprot_sprot.dat" FNR=289522 NR=289522"
Sounds like your data is not like you described it. At some point there are more fields than you presented. Also, sounds like you're not using GNU awk which, to my undestanding, does not have a field limit. Good luck.
1

With GNU awk for multi-char RS and RT:

awk -v RS='\n//\n' -v ORS= -F'[[:space:];]+' '{print $0 RT > ($7".txt")}' file 

With any awk:

awk -F'[[:space:];]+' ' $1 == "AC" { out = $2".txt" } { rec = rec $0 ORS } $0 == "//" { printf "%s", rec > out close out rec = "" } ' file 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.