4

I want to check well-formedness of a big XML file. (about 4GB.)

However, when I try xmlwf, all it tells me is

filename.xml: Value too large for defined data type 

What to do with it? Is there any other way to check it?

(I am using debian linux and gentoo linux)

2
  • I'd guess any XML parser would work, as aren't they required to reject documents which aren't well-formed? A quick suggests checking if xmlstarlet does what you want. Commented Feb 22, 2013 at 18:34
  • 1
    From man xmlwf: "-r Normally xmlwf memory-maps the XML file before parsing; this can result in faster parsing on many platforms. -r turns off memory-mapping and uses normal file IO calls instead. Of course, memory-mapping is automatically turned off when reading from standard input." By the way, I assume you are using a 64-bit setup... Commented Feb 22, 2013 at 19:22

4 Answers 4

2

You might like to try dtdgen, a program I wrote many years ago to generate a DTD for a document. It not only tells you whether a large file is well-formed, it also tells you what's in it (I wrote it because I wanted to know both).

2
xmllint --noout 4GB.xml 

That sort of works.

It goes out of memory, too, but at least it checks something before it dies.

2
  • does xmlint ship with most distros or do you have to install it separately? Commented Feb 22, 2013 at 20:39
  • 1
    @amphibient If not installed you have to install expat. Commented Jan 14, 2015 at 14:40
0

Not try it myself, but try this out :

xmllint --valid 4GB.xml 
4
  • I don't want to try if it's valid. I want to try if it's well-formed. Commented Feb 22, 2013 at 18:10
  • Can you spot me the difference ? Commented Feb 22, 2013 at 18:15
  • xmlblueprint.com/help/html/topic_118.htm Commented Feb 22, 2013 at 18:16
  • 1
    basically, it can't be valid if it doesn't have DTD. And I haven't yet wrote the DTD :) Commented Feb 22, 2013 at 18:18
0

It's an older question, but as I haven't seen it suggested yet:

Perl with XML::Twig can handle large XML files thanks to having a 'purge' method, which discards in memory data as you go.

use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new( twig_handlers => { _all_ => sub { $_->purge } } )->parsefile( 'my_xml_file.xml' ); 

The _all_ handler is triggered each element of the twig, and discards in memory data. That's important on a 4G file, because the memory footprint of XML is about 10x. But it'll throw an alert and abort if the XML is not well formed:

mismatched tag at line 12, column 27, byte 274 at C:/Perl/lib/XML/Parser.pm line 187. 

(but bear in mind because it aborts, it'll only show you the first error it encounters).

Works on my (much smaller than 4G) sample data anyway.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.