Splitting XML file using Grep / Sed / Awk?

Question

Trying to figure out the best way (either by using what I know in Grep / Sed / Awk) to split up an XML file based on it's individual string (key?). I have an XML file that is a SQL dump of all my current FAQ entries so it contains an entry ID and then a rather large HTML formatted document. I'm looking to split these entries up so I can easily pop them into an editor and clean up the formatting to import to a new KB / FAQ system. Here's an example of my data:

 <article id="3"> <language>en</language> <category>Category Name</category> <keywords>Keywords, by, comma</keywords> <question>Question?</question> <answer>HTML Formatting</answer> <author>Author</author> <data>2010-05-13 09:32</data> </article>

The XML file contains every single KB article I have back to back in this format. I am comfortable with bash to figure it out, I just don't know how to split it into multiple files based on the search.

Cheers,

Clay

It'll probalby a lot easier to write some short php/perl/python script parsing your XML and writing it to different files. — inVader
– inVader, Commented Jul 11, 2012 at 19:17
You can find a short perl-solution to a similar problem here: stackoverflow.com/questions/8061475/… There are also some attempts with sed or awk that look like viable options. — inVader
– inVader, Commented Jul 11, 2012 at 19:18
You may be able to do something with a multi-line RS pattern in GNU awk, but I couldn't make it work in casual testing. A sed multiline pattern will be more trouble than it's worth. Your best bets will be Perl, Python, and Ruby unless you enjoy doing things like munging PyX just for the challenge of it all. — Todd A. Jacobs
– Todd A. Jacobs, Commented Jul 11, 2012 at 22:51

szabgab · Accepted Answer · 2014-04-27 17:42:04Z

Use XPath to Extract Articles

If your file is valid XML, you can use a utility like xgrep or XMLStarlet to parse the file for an XPath expression. For example, using xgrep:

xgrep -x "//article[@id]" /tmp/foo

This may be all you need. However, it won't split the articles; it just extracts the correct portions of your XML more reliably than with the use of regular expressions.

Split Article Nodes into Files with Pipeline

If you actually need to split the articles into separate files, you can do something like this:

xgrep -x "//article[@id]" /tmp/foo.rb | ruby -ne 'BEGIN { counter=0 } counter += 1 if /<article/ if /<article/ ... /<\/article/ File.open("#{counter}.xml", "a") { |f| f.puts $_ } end'

Obviously, you could do the whole thing with a Ruby XML library, but I prefer treating this sort of problem as a shell pipeline. Your mileage may vary.

Also, please note that the Ruby script above will number your articles sequentially instead of by article ID. This may be preferable if you have duplicate IDs in your XML.

Pure Ruby with XmlSimple

Okay, okay...I just couldn't leave this one alone. It seemed like a good idea at first to use the external shell utility in a pipeline as above, but if you're going to use Perl or Ruby anyway, you might as well just use the XmlSimple library.

The Ruby script below is a little longer than the pipeline version, but gives you much more control and flexibility. Consider all the possibilities you have with this as a starting point:

#!/usr/bin/env ruby require 'xmlsimple' counter = 0 node_name = 'article' xml = XmlSimple.xml_in '/tmp/foo' xml[node_name].uniq.each do |node| counter = sprintf("%03d", counter.next) XmlSimple.xml_out(node, RootName: node_name, OutputFile: "/tmp/#{counter}.xml") end

rush · Accepted Answer · 2012-07-11 19:38:58Z

cat file.xml | \ perl -p -i -e 'open(F, ">", ($1).".xml") if /<article id="(\d+)"/; print F;'

will split xml file based on article's ids. each article section will be stored in own file with the id number in name. it works really fast even on hige files (sed, awk, etc solutions are really slow in this case).

cat isn't needed; perl can take file arguments. Also, when I try this locally I get cruft if there are other XML tags in the file (e.g. if the last line is </foo>).

inVader · Accepted Answer · 2012-07-11 19:34:35Z

Here's a simple idea for awk:

Whenever you hit a line with an article start tag, increment a counter variable by one. Then, for every line make a system call like "echo $0 >> file$COUNTER". This should be very easy to implement

Collectives™ on Stack Overflow

Splitting XML file using Grep / Sed / Awk?

3 Answers 3

Use XPath to Extract Articles

Split Article Nodes into Files with Pipeline

Pure Ruby with XmlSimple

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Use XPath to Extract Articles

Split Article Nodes into Files with Pipeline

Pure Ruby with XmlSimple

Comments

1 Comment

Comments

Linked

Related