4

Format of the xml:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE > <root> <node> <element1></element1> <element2></element2> <element3></element2> <element4></element3> </node> </root> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE > <root> <node> <element1></element1> <element2></element2> <element3></element2> <element4></element3> </node> </root> 

and several more xml declarations after. BTW, the file size 500MB. I would like to ask for help how to parse this file without breaking it up into different files using PHP.

Any help would be appreciated. Thank you..

3
  • Your document is not considered as valid. stackoverflow.com/questions/5479533/… You can remove the extra declaration using str_replace stackoverflow.com/questions/2159059/… And then work from a valid XML document. Commented May 28, 2012 at 7:32
  • Readers here generally like to see some prior research before asking questions, just so you know. But fwiw, you may wish to use a 'stream reader' such as XMLReader, rather than one that loads the document fully into memory, such as SimpleXML. Commented May 28, 2012 at 8:45
  • I have already the parse code. It is just that the script will not parse the next root node. Thanks anyway for the feedback Commented May 28, 2012 at 11:54

1 Answer 1

2

If you do not want to split the file, you will have to work with it in memory. Given your 500MB file size, this could turn out problematic. Anyway, one option would be to remove the XML Prolog and DocType from all documents and then load the whole thing like this:

$dom = new DOMDocument; $dom->loadXML( sprintf( '<?xml version="1.0" encoding="UTF-8"?>%s' . '<!DOCTYPE >%s' . '<roots>%s</roots>', PHP_EOL, PHP_EOL, str_replace( array( '<?xml version="1.0" encoding="UTF-8"?>', '<!DOCTYPE >' ), '', file_get_contents('/path/to/your/file.xml') ) ) ); 

This would make it one huge XML file with just one XML prolog and one DocType (note I am assuming the DocType is the same for all documents in the file). You could then process the file by iterating over the individual root elements.

Sign up to request clarification or add additional context in comments.

3 Comments

I am using XML reader since I am parsing a large xml file. Can you help me with the equivalent code that will work with XML reader. Since I read the xml by stream or bytes. Thanks.
Thanks for the idea. I just remove the xml tag and doctype while streaming thru the file and added a main root. I works now.
This works for me with a 100MB file and the code runs in about 5 seconds. Note that you'll have to allocate more memory to PHP using something like: ini_set('memory_limit', '768M');

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.