1

I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:

< text id="www.example.com>

and

< /text>

I would like to split the larger file by these tags. So that, for example,

< text id="www.example.com>

Hello

< /text>

< text id="www.example.com>

This is

< /text>

< text id="www.example.com>

An Example

< /text>

Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:

File 1

< text id="www.example.com>

Hello

< /text>

File 2

< text id="www.example.com>

This is

< /text>

File 3

< text id="www.example.com>

An Example

< /text>

I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.

I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?

Thanks in advance for any help!

3 Answers 3

2

The following PERL program found here: Split one file into multiple files based on delimiter

#!/usr/bin/perl open(FI,"file.txt") or die; $cur=0; open(FO,">res.$cur.txt") or die; while(<FI>) { print FO $_; if(/^<\/text>/) # Added \ { close(FO); $cur++; open(FO,">res.$cur.txt") or die; } } close(FO); 

Also seems to do the trick, with no maximum cap.

Cheers.

Sign up to request clarification or add additional context in comments.

Comments

1

The following awk solves the problem, but unfortunately caps out at around 1000 output files

awk '{print $0 ""> "file" NR}' RS='' input-file 

Comments

1

It's a lot more complicated than a simple awk command, and I don't if the file would be to big or not, but you could try using an XSLT V2.0 style sheet with result-document to produce all of your files.

One advantage of using XSLT over a regex is that it would have better support if the file format changes slightly or if there are attributes on the nodes you want to split with.

1 Comment

Thanks for the tip. I will definitely check out the XSLT V2.0. style sheet. Also just for a point of reference, I agree with you about the awk (the exact error I was getting is: awk: cannot open "F1021" for output (Too many open files)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.