1

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

 <article> {lot of other stuff here} </article> <article> {lot of other stuff here} </article> <article> {lot of other stuff here} </article> <article> {lot of other stuff here} </article> 

What i'd like to do is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

2

2 Answers 2

12

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml(location => 'articles.xml'); my $xp = XML::LibXML::XPathContext->new($doc->documentElement); my $xpath = '/articles/article[position() < 4]'; foreach my $article ( $xp->findnodes($xpath) ) { # now do something with $article print $article.": ".$article->getName."\n"; } 

For me this prints:

 XML::LibXML::Element=SCALAR(0x346ef90): article XML::LibXML::Element=SCALAR(0x346ef30): article XML::LibXML::Element=SCALAR(0x346efa8): article 

Links to the relevant documentation:


Original version of the answer, based on the XML::XPath package:

use warnings; use strict; use XML::XPath; my $xp = XML::XPath->new(filename => 'articles.xml'); my $xpath = '/articles/article[position() < 4]'; foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) { # now do something with $article print $article.": ".$article->getName ."\n"; } 

which prints this for me:

 XML::XPath::Node::Element=REF(0x38067b8): article XML::XPath::Node::Element=REF(0x38097e8): article XML::XPath::Node::Element=REF(0x3809ae8): article 

Have a look at the docs to find out what you can do with them.

Sign up to request clarification or add additional context in comments.

10 Comments

This is one case where a regex could easily do the job though.
@Snake Plissken: No, it isn't. Regex is never the right tool for that kind of job, no matter how "easy" it seems. XPath+Programming Language X (Perl in this case) is, or XSLT is. Regex is not.
You're being silly. In this case a regex can easily do the job. What are you going to do in the case that someone asks you to copy a non-XML file until something has been seen three times?
@Snake Plissken: I'm not being silly. I'm just trying to avoid being smart about when to use a proper parser. There is a nice XML parser built into Perl, there is absolutely no reason not to use it. (It's not "oh damn, I have to use a parser because this is too complex for regex", it's "oh damn, I can't use a parser because the language I use does not supply one". And the latter is almost never true.)
FYI, XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.
|
0

Here:

 open my $input, "<", "file.xml" or die $!; open my $output, ">", "truncated-file.xml" or die $!; my $n_articles = 0; while (<$input>) { print $output $_; if (m:</article>:) { $n_articles++; if ($n_articles >= 3) { last; } } } close $input or die $!; close $output or die $!; 

You really don't need an XML parser to do such a simple job.

3 Comments

What that script did is it copied all the contents of the file.xml into truncated-file.xml
Then it's debugging time for you. Anyway there is another answer you can use if this doesn't work.
I was referring to the other answer on this thread: stackoverflow.com/questions/2964637/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.