Parsing XML file with perl - regex

Question

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

 <article> {lot of other stuff here} </article> <article> {lot of other stuff here} </article> <article> {lot of other stuff here} </article> <article> {lot of other stuff here} </article>

What i'd like to do is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

possible duplicate of How can I use Perl regular expressions to parse XML data? — Quentin
– Quentin, Commented Jun 3, 2010 at 9:28
@SMark: Even if. -- Perl6 regular expressions are still the wrong tool for that. ;-) — Tomalak
– Tomalak, Commented Jun 3, 2010 at 9:52

Community · Accepted Answer · 2017-05-23 12:30:45Z

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml(location => 'articles.xml'); my $xp = XML::LibXML::XPathContext->new($doc->documentElement); my $xpath = '/articles/article[position() < 4]'; foreach my $article ( $xp->findnodes($xpath) ) { # now do something with $article print $article.": ".$article->getName."\n"; }

For me this prints:

 XML::LibXML::Element=SCALAR(0x346ef90): article XML::LibXML::Element=SCALAR(0x346ef30): article XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:

The type of $doc will be XML::LibXML::Document.
The type of $xp is XML::LibXML::XPathContext.
The return type of $xp->findnodes() is XML::LibXML::NodeList.
The type $article is XML::LibXML::Element.

Original version of the answer, based on the XML::XPath package:

use warnings; use strict; use XML::XPath; my $xp = XML::XPath->new(filename => 'articles.xml'); my $xpath = '/articles/article[position() < 4]'; foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) { # now do something with $article print $article.": ".$article->getName ."\n"; }

which prints this for me:

 XML::XPath::Node::Element=REF(0x38067b8): article XML::XPath::Node::Element=REF(0x38097e8): article XML::XPath::Node::Element=REF(0x3809ae8): article

The type of $xp is XML::XPath, obviously.
The return type of $xp->findnodes() is XML::XPath::NodeSet.
The type of $article will be XML::XPath::Node::Element in this case.

Have a look at the docs to find out what you can do with them.

This is one case where a regex could easily do the job though.
@Snake Plissken: No, it isn't. Regex is never the right tool for that kind of job, no matter how "easy" it seems. XPath+Programming Language X (Perl in this case) is, or XSLT is. Regex is not.
You're being silly. In this case a regex can easily do the job. What are you going to do in the case that someone asks you to copy a non-XML file until something has been seen three times?
@Snake Plissken: I'm not being silly. I'm just trying to avoid being smart about when to use a proper parser. There is a nice XML parser built into Perl, there is absolutely no reason not to use it. (It's not "oh damn, I have to use a parser because this is too complex for regex", it's "oh damn, I can't use a parser because the language I use does not supply one". And the latter is almost never true.)
FYI, XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

Snake Plissken · Accepted Answer · 2010-06-03 11:24:06Z

0

Here:

 open my $input, "<", "file.xml" or die $!; open my $output, ">", "truncated-file.xml" or die $!; my $n_articles = 0; while (<$input>) { print $output $_; if (m:</article>:) { $n_articles++; if ($n_articles >= 3) { last; } } } close $input or die $!; close $output or die $!;

You really don't need an XML parser to do such a simple job.

answered Jun 3, 2010 at 11:24

Snake Plissken

6783 silver badges8 bronze badges

3 Comments

dusker Over a year ago

What that script did is it copied all the contents of the file.xml into truncated-file.xml

Snake Plissken Over a year ago

Then it's debugging time for you. Anyway there is another answer you can use if this doesn't work.

Snake Plissken Over a year ago

I was referring to the other answer on this thread: stackoverflow.com/questions/2964637/…

Collectives™ on Stack Overflow

Parsing XML file with perl - regex

2 Answers 2

10 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

3 Comments

Linked

Related