1

Trying to figure out the best way (either by using what I know in Grep / Sed / Awk) to split up an XML file based on it's individual string (key?). I have an XML file that is a SQL dump of all my current FAQ entries so it contains an entry ID and then a rather large HTML formatted document. I'm looking to split these entries up so I can easily pop them into an editor and clean up the formatting to import to a new KB / FAQ system. Here's an example of my data:

 <article id="3"> <language>en</language> <category>Category Name</category> <keywords>Keywords, by, comma</keywords> <question>Question?</question> <answer>HTML Formatting</answer> <author>Author</author> <data>2010-05-13 09:32</data> </article> 

The XML file contains every single KB article I have back to back in this format. I am comfortable with bash to figure it out, I just don't know how to split it into multiple files based on the search.

Cheers,

Clay

3
  • It'll probalby a lot easier to write some short php/perl/python script parsing your XML and writing it to different files. Commented Jul 11, 2012 at 19:17
  • You can find a short perl-solution to a similar problem here: stackoverflow.com/questions/8061475/… There are also some attempts with sed or awk that look like viable options. Commented Jul 11, 2012 at 19:18
  • You may be able to do something with a multi-line RS pattern in GNU awk, but I couldn't make it work in casual testing. A sed multiline pattern will be more trouble than it's worth. Your best bets will be Perl, Python, and Ruby unless you enjoy doing things like munging PyX just for the challenge of it all. Commented Jul 11, 2012 at 22:51

3 Answers 3

6

Use XPath to Extract Articles

If your file is valid XML, you can use a utility like xgrep or XMLStarlet to parse the file for an XPath expression. For example, using xgrep:

xgrep -x "//article[@id]" /tmp/foo 

This may be all you need. However, it won't split the articles; it just extracts the correct portions of your XML more reliably than with the use of regular expressions.

Split Article Nodes into Files with Pipeline

If you actually need to split the articles into separate files, you can do something like this:

xgrep -x "//article[@id]" /tmp/foo.rb | ruby -ne 'BEGIN { counter=0 } counter += 1 if /<article/ if /<article/ ... /<\/article/ File.open("#{counter}.xml", "a") { |f| f.puts $_ } end' 

Obviously, you could do the whole thing with a Ruby XML library, but I prefer treating this sort of problem as a shell pipeline. Your mileage may vary.

Also, please note that the Ruby script above will number your articles sequentially instead of by article ID. This may be preferable if you have duplicate IDs in your XML.

Pure Ruby with XmlSimple

Okay, okay...I just couldn't leave this one alone. It seemed like a good idea at first to use the external shell utility in a pipeline as above, but if you're going to use Perl or Ruby anyway, you might as well just use the XmlSimple library.

The Ruby script below is a little longer than the pipeline version, but gives you much more control and flexibility. Consider all the possibilities you have with this as a starting point:

#!/usr/bin/env ruby require 'xmlsimple' counter = 0 node_name = 'article' xml = XmlSimple.xml_in '/tmp/foo' xml[node_name].uniq.each do |node| counter = sprintf("%03d", counter.next) XmlSimple.xml_out(node, RootName: node_name, OutputFile: "/tmp/#{counter}.xml") end 
Sign up to request clarification or add additional context in comments.

Comments

2
cat file.xml | \ perl -p -i -e 'open(F, ">", ($1).".xml") if /<article id="(\d+)"/; print F;' 

will split xml file based on article's ids. each article section will be stored in own file with the id number in name. it works really fast even on hige files (sed, awk, etc solutions are really slow in this case).

1 Comment

cat isn't needed; perl can take file arguments. Also, when I try this locally I get cruft if there are other XML tags in the file (e.g. if the last line is </foo>).
0

Here's a simple idea for awk:

Whenever you hit a line with an article start tag, increment a counter variable by one. Then, for every line make a system call like "echo $0 >> file$COUNTER". This should be very easy to implement

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.