Split one file into multiple files based on pattern

Question

I have a binary file which I convert into a regular file using hexdump and few awk and sed commands. The output file looks something like this -

$cat temp 3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e5820000000000000000000 000000087d3f513000000000000000000000000000000000001001001010f000000000026 58783100b354c52658783100b43d3d0000ad6413400103231665f301010b9130194899f2f fffffffffff02007c00dc015800a040402802f1d5b2b8ca5674504f433031000000000004 6363070000000000000000000000000065450000b4fb6b4000393d3d1116cdcc57e58287d 3f55285a1084b

The temp file has few eye catchers (3d3d) which don't repeat that often. They kinda denote a start of new binary record. I need to split the file based on those eye catchers.

My desired output is to have multiple files (based on the number of eyecatchers in my temp file).

So my output would look something like this -

$cat temp1 3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e582000000000000000 0000000000087d3f513000000000000000000000000000000000001001001010f00000000 002658783100b354c52658783100b4 $cat temp2 3d3d0000ad6413400103231665f301010b9130194899f2ffffffffffff02007c00dc0 15800a040402802f1d5b2b8ca5674504f4330310000000000046363070000000000000000 000000000065450000b4fb6b400039 $cat temp3 3d3d1116cdcc57e58287d3f55285a1084b

Michael J. Barber · Accepted Answer · 2011-11-09 09:06:26Z

21

The RS variable in awk is nice for this, allowing you to define the record separator. Thus, you just need to capture each record in its own temp file. The simplest version is:

cat temp | awk -v RS="3d3d" '{ print $0 > "temp" NR }'

The sample text starts with the eye-catcher 3d3d, so temp1 will be an empty file. Further, the eye-catcher itself won't be at the start of the temp files, as was shown for the temp files in the question. Finally, if there are a lot of records, you could run into the system limit on open files. Some minor complications will bring it closer to what you want and make it safer:

cat temp | awk -v RS="3d3d" 'NR > 1 { print RS $0 > "temp" (NR-1); close("temp" (NR-1)) }'

edited Nov 9, 2011 at 9:06

answered Nov 9, 2011 at 8:53

Michael J. Barber

25.2k9 gold badges71 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Zsolt Botykai Over a year ago

Khm, you don't need cat for that. And if it's a single line input you will only get the first record. And the output will miss the original RS as well. echo '3d3dsomething3d3danything' | awk 'BEGIN {RS="3d3d"} {print}' will only output something.

Zsolt Botykai Over a year ago

Or I was wrong. The only problem with your solution is missing the RS in the output. (And the useless use of cat.)

Michael J. Barber Over a year ago

@ZsoltBotykai RS is in the output, as discussed. And cat is not useless: it provides a logical separation between generation of data and processing. Thus, cat temp stands in for whatever transformations go on before the awk stage, while avoiding adding even more to the already long line with awk.

Zsolt Botykai Over a year ago

You are right, sorry about the RS part. Regarding cat, you might want to read this: partmaps.org/era/unix/award.html and this: en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat

Michael J. Barber Over a year ago

@ZsoltBotykai Well aware of it, although not convinced that it says anything relevant about the appropriate rhetorical exposition. You might want to read the other view in, e.g., Classic Shell Scripting (Robbins and Beebe, 2005).

|

rob mayoff · Accepted Answer · 2011-11-09 07:18:26Z

16

#!/usr/bin/perl undef $/; $_ = <>; $n = 0; for $match (split(/(?=3d3d)/)) { open(O, '>temp' . ++$n); print O $match; close(O); }

answered Nov 9, 2011 at 7:18

rob mayoff

387k69 gold badges842 silver badges887 bronze badges

4 Comments

jaypal singh Over a year ago

Thanks this works great and I can call this script within my parser script before running the parser code so that it runs on all the temp files.

jaypal singh Over a year ago

Any suggestions on which book should I pick up for learning Perl. I am new to UNIX and have recently started learning bash, sed and awk.

Nicolas Raoul Over a year ago

Usage: Copy into new file split.pl then make it executable and run: ./split.pl yourdata.txt

Newbie Over a year ago

@rob-mayoff can you help me with this: stackoverflow.com/questions/42671047/…

potong · Accepted Answer · 2011-11-09 10:58:22Z

This might work:

# sed 's/3d3d/\n&/2g' temp | split -dl1 - temp # ls temp temp00 temp01 temp02 # cat temp00 3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e5820000000000000000000000000087d3f513000000000000000000000000000000000001001001010f000000000026 58783100b354c52658783100b4 # cat temp01 3d3d0000ad6413400103231665f301010b9130194899f2ffffffffffff02007c00dc015800a040402802f1d5b2b8ca5674504f4330310000000000046363070000000000000000000000000065450000b4fb6b400039 # cat temp02 3d3d1116cdcc57e58287d3f55285a1084b

EDIT:

If there are newlines in the source file you can remove them first by using tr -d '\n' <temp and then pipe the output through the above sed command. If however you wish to preserve them then:

 sed 's/3d3d/\n&/g;s/^\n\(3d3d\)/\1/' temp |csplit -zf temp - '/^3d3d/' {*}

Should do the trick

Just to remark that the effect of the combined flags 2g to sed's s command is not standardized. The author expects GNU sed's behaviour : For GNU 'sed', the interaction is defined to be: ignore matches before the NUMBERth, and then match and replace all matches from the NUMBERth on.

mLuby · Accepted Answer · 2019-06-05 23:05:08Z

Mac OS X answer

Where that nice awk -v RS="pattern" trick doesn't work. Here's what I got working:

Given this example concatted.txt

filename=foo bar foo bar line1 foo bar line2 filename=baz qux baz qux line1 baz qux line2

use this command (remove comments to prevent it from failing)

# cat: useless use of cat ^__^; # tr: replace all newlines with delimiter1 (which must not be in concatted.txt) so we have one line of all the next # sed: replace file start pattern with delimiter2 (which must not be in concatted.txt) so we know where to split out each file # tr: replace delimiter2 with NULL character since sed can't do it # xargs: split giant single-line input on NULL character and pass 1 line (= 1 file) at a time to echo into the pipe # sed: get all but last line (same as head -n -1) because there's an extra since concatted-file.txt ends in a NULL character. # awk: does a bunch of stuff as the final command. Remember it's getting a single line to work with. # {replace all delimiter1s in file with newlines (in place)} # {match regex (sets RSTART and RLENGTH) then set filename to regex match (might end at delimiter1). Note in this case the number 9 is the length of "filename=" and the 2 removes the "§" } # {write file to filename and close the file (to avoid "too many files open" error)} cat ../concatted-file.txt \ | tr '\n' '§' \ | sed 's/filename=/∂filename=/g' \ | tr '∂' '\0' \ | xargs -t -0 -n1 echo \ | sed \$d \ | awk '{match($0, /filename=[^§]+§/)} {filename=substr($0, RSTART+9, RLENGTH-9-2)".txt"} {gsub(/§/, "\n", $0)} {print $0 > filename; close(filename)}'

results in these two files named foo bar.txt and baz qux.txt respectively:

filename=foo bar foo bar line1 foo bar line2

filename=baz qux baz qux line1 baz qux line2

Hope this helps!

Zsolt Botykai · Accepted Answer · 2011-11-09 07:23:30Z

It depends if it's a single line in your temp file or not. But assuming if it's a single line, you can go with:

sed 's/\(.\)\(3d3d\)/\1#\2/g' FILE | awk -F "#" '{ for (i=1; i++; i<=NF) { print $i > "temp" i } }'

The first sed inserts a # as a field/record separator, then awk splits on # and prints every "field" to its own file.

If the input file is already split on 3d3d then you can go with:

awk '/^3d3d/ { i++ } { print > "temp" i }' temp

HTH

Collectives™ on Stack Overflow

Split one file into multiple files based on pattern

5 Answers 5

7 Comments

4 Comments

1 Comment

Mac OS X answer

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

4 Comments

1 Comment

Mac OS X answer

Comments

Comments

Linked

Related