2

I have a big (with couple of hundred thousands records) XML file from which I'd like to filter only specific fields. example of the file construction:

<A> <id>123</id> <B> <C>value1</C> <D>value2</D> .... <E></E> </B> <Z></Z> ... <Y></Y> <A> 

I'd like to filter this XML file and contain only the id and the data enclosed in C and D fields.

How this can be done?

5
  • If <C>...</C> always in one line try grep -o '<[CD]>[^<]*</[CD]>' Commented Aug 24, 2015 at 8:40
  • A,B,..Z are just to replace the names of the actual parameter. what should be done in this case Commented Aug 24, 2015 at 8:42
  • grep -o '<\(parameterC\|parameterD\)>[^<]*</\1>' Commented Aug 24, 2015 at 8:46
  • 1
    I really wouldn't suggest using grep - XML is not a thing that's easily greppable, thanks to whitespace reformatting, tag nesting and unary tags. Not to mention handing broken XML appropriately. (e.g. you should at least detect if tags aren't closed). Commented Aug 28, 2015 at 13:05
  • well, as part of some troubleshooting, there is a need to understand some phenomena for a records which contain a big amount of data but I need only part of it. I think that the best option would be to get it into Excel so that I can see it and filter the exact values which I'm looking for. Therefore I think about performing grep on the XML. Commented Aug 29, 2015 at 13:46

3 Answers 3

4

The xmlstarlet tool will do this:

xmlstarlet sel -t -m /A -o ID, -v id -n -o C, -v //C -n -o D, -v //D -n test.xml 

For each A under the root element (-m /A), it prints the string "ID," (-o ID,), the contents of id (-v id), a newline (-n), and likewise for children C (-v //C)and D (-v //D) with their respective headers. The double slashes are the XPath for "anywhere under the matched node."

The result, as tested on my system, using your test file, is the comma-separated output:

ID,123 C,value1 D,value2 

If you don't want the headers, omit the -o <whatever> arguments.

Thanks to this article for explanation.

0

To answer this question properly, we'd ideally need a better example - some valid xml is a good start.

Also - an example of desired output. You don't, for example, indicate where you'd want the <C> and <D> elements to end up within your resultant XML. They're already children of <B> - do you want to preserve B or reparent C and D to the root?

However generically reconstructing XML is quite easy using XML::Twig and perl.

E.g. Like so:

#!/usr/bin/perl use strict; use warnings; use XML::Twig; my @wanted = qw ( C D id ); my %wanted = map { $_ => 1 } @wanted; sub delete_unwanted_tags { my ( $twig, $element ) = @_; my $tag = $element -> tag; if ( not $wanted{$tag} ) { $element -> delete; } } my $twig = XML::Twig -> new ( twig_handlers => { _all_ => \&delete_unwanted_tags } ); $twig -> parse ( \*DATA ); $twig -> print; __DATA__ <A> <id>123</id> <B> <C>value1</C> <D>value2</D> <E></E> </B> <Z></Z> <Y></Y> </A> 

Because we haven't said "keep <B>" the result is:

<A> <id>123</id> </A> 

Adding <B> to the wanted list:

<A> <id>123</id> <B> <C>value1</C> <D>value2</D> </B> </A> 

If however, what you want to do is reparent C and D into A:

#!/usr/bin/perl use strict; use warnings; use XML::Twig; my @wanted = qw ( id); my @reparent = qw ( C D ); #turn the above into hashes, so we can do "if $wanted{$tag}" my %wanted = map { $_ => 1 } @wanted; my %reparent = map { $_ => 1 } @reparent; sub delete_unwanted_tags { my ( $twig, $element ) = @_; my $tag = $element->tag; if ( not $wanted{$tag} ) { $element->delete; } if ( $reparent{$tag} ) { $element->move( 'last_child', $twig->root ); } } my $twig = XML::Twig->new( pretty_print => 'indented_a', twig_handlers => { _all_ => \&delete_unwanted_tags } ); $twig->parse( \*DATA ); $twig->print; __DATA__ <A> <id>123</id> <B> <C>value1</C> <D>value2</D> <E></E> </B> <Z></Z> <Y></Y> </A> 

Note - the "twig handler" is called at the end of each element (when a close tag is encountered) which is why this works - we recurse down to find C and D before we finish processing (and deleting) B.

This produces:

<A> <id>123</id> <C>value1</C> <D>value2</D> </A> 

In the above, I have used __DATA__, \*DATA and parse because it allows me to illustrate both the XML and techiques. You should probably use instead parsefile('my_file.xml') instead of parse(\*DATA).

0

Use lxgrep from the ltXML2 toolkit (Edinburgh University), eg

$ lxgrep -w A '(id|C|D)' test.xml <A> <id>123</id> <C>value1</C> <D>value2</D> </A> 

Using these kinds of tool is far faster and more reliable than rolling your own.


XML FAQ: http://xml.silmaril.ie/

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.