1

I have a huge XML file that I want to split into chunks based on the product type attribute.

I don't know how to use XSLT. I found xml_split but can't figure out how to use it with a regex or XPath to split the document depending on the type attribute

<?xml version="1.0"?> <!DOCTYPE catalog SYSTEM "catalog.dtd"> <catalog> <product type="cloths" product_image="cardigan.jpg"> <catalog_item gender="Men's"> <item_number>QWZ5671</item_number> <price>39.95</price> <size description="Medium"> <color_swatch image="red_cardigan.jpg">Red</color_swatch> <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> </size> <size description="Large"> <color_swatch image="red_cardigan.jpg">Red</color_swatch> <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> </size> </catalog_item> <catalog_item gender="Women's"> <item_number>RRX9856</item_number> <price>42.50</price> <size description="Small"> <color_swatch image="red_cardigan.jpg">Red</color_swatch> <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> </size> <size description="Medium"> <color_swatch image="red_cardigan.jpg">Red</color_swatch> <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> <color_swatch image="black_cardigan.jpg">Black</color_swatch> </size> <size description="Large"> <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> <color_swatch image="black_cardigan.jpg">Black</color_swatch> </size> <size description="Extra Large"> <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> <color_swatch image="black_cardigan.jpg">Black</color_swatch> </size> </catalog_item> </product> </catalog> 

I used this command

xml_split -c /catalog/product[@type='cloths'] products.xml 

but it reproduces the complete XML document without the XPath filtering.

11
  • And what result are you trying to get? Because that says 'everything under that product' and that's everything in your XML. Which elements are you trying to separate out? Commented Dec 31, 2015 at 20:59
  • The xml is 400k in size has the <product type="X"> where X is cloths, electronics, .. etc so this is just a single part of a product type but I want to split this 400k to many chunks based on the type attribute. Commented Dec 31, 2015 at 21:04
  • Do you just want the product element? And are they unique? (is there only one product element of each type?) Commented Dec 31, 2015 at 21:07
  • Also: 400k isn't huge. 400G is huge. A 400k XML file will take maybe 4MB of memory, which isn't an insufferable amount, meaning you can do some other tricks (like reparent nodes into new documents) Commented Dec 31, 2015 at 21:09
  • Sorry each product type has too many entries grep -c '<product type="cloths" products.xml // output : 8039 and it is 400 MB :) Commented Dec 31, 2015 at 21:12

1 Answer 1

2

OK, so - if I read you right, you're looking at separating out your product types into separate files.

I'd probably do that like this, using XML::Twig:

#!/usr/bin/env perl use strict; use warnings; use XML::Twig; sub split_product { my ( $twig, $product ) = @_; open( my $output, '>>', $product->att('type') . ".xml" ) or warn $!; print {$output} $product->sprint; $twig -> purge; } my $twig = XML::Twig->new( pretty_print => 'indented_a', twig_handlers => { 'product' => \&split_product } ); $twig->parsefile('source.xml'); 

This won't preserve XML structure though, it'll just put the 'product' elements into a new file. (And that won't be valid XML if there's multiple products of the same type, either).

OK, so given multiple products of each type, it's necessary to traverse the file. This makes it more complicated, because you can't 'close off' your XML until you know what needs to be in it, which means you need to traverse your tree twice, potentially.

The simpler (memory intensive) way of tackling this problem would be:

#!/usr/bin/env perl use strict; use warnings; my %products; use XML::Twig; sub split_product { my ( $twig, $product ) = @_; my $type = $product->att('type'); if ( not $products{$type} ) { my $new_product = XML::Twig->new; $new_product->set_root( XML::Twig::Elt->new('catalogue') ); $new_product->set_xml_version('1.0'); $new_product->set_encoding('utf-8'); $new_product->set_doctype('catalog SYSTEM "catalog.dtd"'); $products{$type} = $new_product; } $product->cut; $product->paste( 'last_child', $products{$type}->root ); $twig->purge; } my $twig = XML::Twig->new( pretty_print => 'indented_a', twig_handlers => { 'product' => \&split_product } ); $twig->parsefile ( 'your_file.xml' ); foreach my $product_type ( keys %products ) { open ( my $output, '>', "$product_type.xml" ) or warn $!; print {$output} $products{$product_type}->sprint; } 

This will cut it up into separate valid documents, but be warned - it will consume about 10x the size of your XML in memory.

And last, but not least - a (hopefully!) less memory intensive version, that uses flush and purge to dump parsed XML.

#!/usr/bin/env perl use strict; use warnings; my %products; my %product_files; use XML::Twig; sub split_product { my ( $twig, $product ) = @_; my $type = $product->att('type'); if ( not $products{$type} ) { my $new_product = XML::Twig->new; $new_product->set_root( XML::Twig::Elt->new('catalogue') ); $new_product->set_xml_version('1.0'); $new_product->set_encoding('utf-8'); $new_product->set_doctype('catalog SYSTEM "catalog.dtd"'); $products{$type} = $new_product; open( $product_files{$type}, '>', "$type.xml" ) or warn $!; } $product->cut; $product->paste( 'last_child', $products{$type}->root ); $twig->purge; $products{$type}->flush( $product_files{$type} ); } my $twig = XML::Twig->new( pretty_print => 'indented_a', twig_handlers => { 'product' => \&split_product } ); $twig->parsefile ( 'your_file.xml' ); foreach my $product_type ( keys %products ) { $products{$product_type}->flush( $product_files{$product_type} ); close( $product_files{$product_type} ); } 

If you want to just select one particular type, we can either set it within the script:

my $target_type = 'cloths'; 

Or read it from @ARGV (command line args).

my ( $target_type ) = @ARGV; 

And then either set your 'twig_handler' to:

"product[\@type=\"$target_type\"]" => \&split_product 

Although that will mean purging data from memory happens less often. So instead you can add into the handler:

if ( $product -> att('type') eq $target_type ) { $twig -> purge; return; } 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.