2

I have a log which contains XML lines. Sample format below:

<head> <body> <line> asdasd</line> </body> </head> 

I want to scan the log file and append the lines that do not start with '<' to the previous line. Output would be like below:

<head> <body> <line>asdasd</line> </body> </head> 

Thanks

4 Answers 4

3

I think I've said this before - but at risk of sounding like a stuck record - DON'T use regular expressions to parse XML. It's brittle and prone to breaking. I would ask first though - why are you trying to do what you're doing? Because it should be irrelevant when working with your XML.

Instead use a parser:

#!/usr/bin/env perl use strict; use warnings; use XML::Twig; my $twig = XML::Twig->parsefile('your_file.xml'); foreach my $elt ( $twig->get_xpath('//#PCDATA') ) { $elt->set_text( $elt->trimmed_text ); } $twig->set_pretty_print('indented_a'); $twig->print; 

This does what you want... but if you're actually working with the XML normally, that trimmed_text method probably removes the need for this processing anyway.

1
  • I can't vote today anymore, but I wanted to say thet this is the correct answer. Commented Nov 17, 2015 at 14:55
2

Perl to the rescue!

perl -pe 'print "\n" if /^\s*+</; chomp;' input > output 

i.e. newline is removed from each line, and it's printed when the next line starts with whitespace followed by a <.

To keep the final newline, change chomp to chomp unless eof or add END { print "\n" }

1
  • 2
    Listen to @Sobrique: use an xml parser for this. It will help you immensely in the future. Commented Nov 17, 2015 at 14:56
1

Almost standard sed procedure

sed '$!N;s/\n\(\s*[^<[:blank:]]\)/\1/;P;D' log.xml 
0

Using the XPath function normalize-space to remove the initial newline of the /head/body/line node:

xmlstarlet edit --update '/head/body/line' --expr 'normalize-space(text())' file.xml 

Or, using abbreviated names:

xmlstarlet ed -u '/head/body/line' -x 'normalize-space(text())' file.xml 

The output, given the input in the question, would be

<?xml version="1.0"?> <head> <body> <line>asdasd</line> </body> </head> 

Use //line in place of the full path from the root node if you want to affect all line nodes in your input document.

Add -O or --omit-decl after edit or ed to discard the <?xml ...> declaration at the start of the resulting document.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.