filter text xml file

Question

I have a big (with couple of hundred thousands records) XML file from which I'd like to filter only specific fields. example of the file construction:

<A> <id>123</id> <B> <C>value1</C> <D>value2</D> .... <E></E> </B> <Z></Z> ... <Y></Y> <A>

I'd like to filter this XML file and contain only the id and the data enclosed in C and D fields.

How this can be done?

If <C>...</C> always in one line try grep -o '<[CD]>[^<]*</[CD]>' — Costas
– Costas, Commented Aug 24, 2015 at 8:40
A,B,..Z are just to replace the names of the actual parameter. what should be done in this case — user1977050
– user1977050, Commented Aug 24, 2015 at 8:42
I really wouldn't suggest using grep - XML is not a thing that's easily greppable, thanks to whitespace reformatting, tag nesting and unary tags. Not to mention handing broken XML appropriately. (e.g. you should at least detect if tags aren't closed). — Sobrique
– Sobrique, Commented Aug 28, 2015 at 13:05
well, as part of some troubleshooting, there is a need to understand some phenomena for a records which contain a big amount of data but I need only part of it. I think that the best option would be to get it into Excel so that I can see it and filter the exact values which I'm looking for. Therefore I think about performing grep on the XML. — user1977050
– user1977050, Commented Aug 29, 2015 at 13:46

cxw · Accepted Answer · 2015-08-28 13:35:35Z

The xmlstarlet tool will do this:

xmlstarlet sel -t -m /A -o ID, -v id -n -o C, -v //C -n -o D, -v //D -n test.xml

For each A under the root element (-m /A), it prints the string "ID," (-o ID,), the contents of id (-v id), a newline (-n), and likewise for children C (-v //C)and D (-v //D) with their respective headers. The double slashes are the XPath for "anywhere under the matched node."

The result, as tested on my system, using your test file, is the comma-separated output:

ID,123 C,value1 D,value2

If you don't want the headers, omit the -o <whatever> arguments.

Thanks to this article for explanation.

Sobrique · Accepted Answer · 2015-08-28 13:04:00Z

To answer this question properly, we'd ideally need a better example - some valid xml is a good start.

Also - an example of desired output. You don't, for example, indicate where you'd want the <C> and <D> elements to end up within your resultant XML. They're already children of <B> - do you want to preserve B or reparent C and D to the root?

However generically reconstructing XML is quite easy using XML::Twig and perl.

E.g. Like so:

#!/usr/bin/perl use strict; use warnings; use XML::Twig; my @wanted = qw ( C D id ); my %wanted = map { $_ => 1 } @wanted; sub delete_unwanted_tags { my ( $twig, $element ) = @_; my $tag = $element -> tag; if ( not $wanted{$tag} ) { $element -> delete; } } my $twig = XML::Twig -> new ( twig_handlers => { _all_ => \&delete_unwanted_tags } ); $twig -> parse ( \*DATA ); $twig -> print; __DATA__ <A> <id>123</id> <B> <C>value1</C> <D>value2</D> <E></E> </B> <Z></Z> <Y></Y> </A>

Because we haven't said "keep <B>" the result is:

<A> <id>123</id> </A>

Adding <B> to the wanted list:

<A> <id>123</id> <B> <C>value1</C> <D>value2</D> </B> </A>

If however, what you want to do is reparent C and D into A:

#!/usr/bin/perl use strict; use warnings; use XML::Twig; my @wanted = qw ( id); my @reparent = qw ( C D ); #turn the above into hashes, so we can do "if $wanted{$tag}" my %wanted = map { $_ => 1 } @wanted; my %reparent = map { $_ => 1 } @reparent; sub delete_unwanted_tags { my ( $twig, $element ) = @_; my $tag = $element->tag; if ( not $wanted{$tag} ) { $element->delete; } if ( $reparent{$tag} ) { $element->move( 'last_child', $twig->root ); } } my $twig = XML::Twig->new( pretty_print => 'indented_a', twig_handlers => { _all_ => \&delete_unwanted_tags } ); $twig->parse( \*DATA ); $twig->print; __DATA__ <A> <id>123</id> <B> <C>value1</C> <D>value2</D> <E></E> </B> <Z></Z> <Y></Y> </A>

Note - the "twig handler" is called at the end of each element (when a close tag is encountered) which is why this works - we recurse down to find C and D before we finish processing (and deleting) B.

This produces:

<A> <id>123</id> <C>value1</C> <D>value2</D> </A>

In the above, I have used __DATA__, \*DATA and parse because it allows me to illustrate both the XML and techiques. You should probably use instead parsefile('my_file.xml') instead of parse(\*DATA).

Peter Flynn · Accepted Answer · 2016-07-11 22:26:48Z

Use lxgrep from the ltXML2 toolkit (Edinburgh University), eg

$ lxgrep -w A '(id|C|D)' test.xml <A> <id>123</id> <C>value1</C> <D>value2</D> </A>

Using these kinds of tool is far faster and more reliable than rolling your own.

XML FAQ: http://xml.silmaril.ie/

Stack Exchange Network

filter text xml file

3 Answers 3

You must log in to answer this question.

Hot Network Questions

filter text xml file

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions