2

I have an XML-like text file, which cannot be parsed with an XML parser due to XML violations:

<note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> 

I just want to cut the word after n th tag open < in a file. The file should be in XML-grammar, which means lines can vary.

My expected output would be,

1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note 
7
  • This task is far from trivial: consider that your input xml structure has arbitrary nested levels: <note> <to><to_1><to_1_1>to_child1_1</to_1_1><to_1_2>to_child2_2</to_1_2></to_1><to_2>to_child2</to_2></to> <from>Jani</from><heading>Reminder</heading></note> What should be the 5th node in such case? Commented Feb 15, 2018 at 6:35
  • @RomanPerekhrest Updated My question to clarify the requirment Commented Feb 15, 2018 at 6:43
  • @RomanPerekhrest Changed the title.Hope now it looks better Commented Feb 15, 2018 at 7:16
  • Your requirements say you're only interested in opening tags but your example suggests you are also interested in closing tags. What about self-closing tags? And what is a "word" - do you mean any valid XML name, including colons? Commented Feb 15, 2018 at 9:31
  • Also, I don't understand all the solutions, but I think they might not behave the way you want if there are attributes in the start tags. (But since you haven't said what you want, perhaps you don't care.) Commented Feb 15, 2018 at 9:44

5 Answers 5

3
$ awk -F"[<>]" '{for(i=2;i<=NF;i+=2){print ++j" - "$i}}' input.xml 1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note 
3

Note: This answer was written before the user explained that the XML was not well formed. I'm leaving it here as it may possibly help others.


XMLStarlet is able to produce the element structure of XML documents:

$ xml el file.xml note note/to note/from note/heading note/body 

This is different from your expected output, but may be enough for what you want to achieve.

It is also able to convert the XML to PYX, which shows the opening and closing tags on separate lines:

$ xml pyx file.xml (note -\n (to -Tove )to -\n (from -Jani )from -\n (heading -Reminder )heading -\n (body -Don't forget me this weekend! )body -\n )note 

From this, it's easy to get exactly the output you are after:

$ xml pyx file.xml | sed -n -e 's/^(//p' -e 's/^)/\//p'| nl 1 note 2 to 3 /to 4 from 5 /from 6 heading 7 /heading 8 body 9 /body 10 /note 

The sed instructions removes lines not starting with either ( or ) and replaces these characters according to how you specified it in the question. The nl utility puts line number on lines.


XMLStarlet is sometimes installed as xmlstarlet rather than xml.

2

grep + awk solution:

grep -Eo '<[^<>]+>' input.xml | awk '{ gsub(/[<>]/,""); printf "%-3s - %s\n", NR, $0 }' 

The output:

1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note 

Or with single GNU awk command:

awk -v FPAT='</?[^<>]+>' '{ for(i=1;i<=NF;i++) printf "%-3s - %s\n", ++c, $i }' input.xml 
3
  • @Uchiha_Itachi, you're welcome. You may also try my second GNU awk one-liner Commented Feb 15, 2018 at 7:39
  • @user32929 You are exactly right. Here i am stuck with a non-xml file which was created to be a xml file. Some how it errored and can't parse it. Thank you for your response. Commented Feb 15, 2018 at 9:38
  • Unfortunately there are a lot of people trying to write XML without using a proper library and getting it wrong, forcing the recipients to parse the bad XML with using a proper library either. If these people were plumbers or electricians they would lose their licence to practice. Commented Feb 15, 2018 at 11:52
2

here is an quite easy method to answer your question on extracting openning tags... but your example ask also for closing ones .... this seems nosense because a closed one is open of course.... do you really need also closed ones but if you wanna control xml format but the use a tool like xmllint ....

bash-4.4$ cat > toto <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> bash-4.4$ awk '{ match($0,/<\/.*>/); b=substr($0,RSTART,RLENGTH); if(b) {a[++i]=b} } END{ {for(k in a) {c[a[k]]=k} } {for(u in c) {gsub(/\//,X,u);print u} } }' toto | sed 's/</- /;s/>//' | cat -n 1 - body 2 - note 3 - to 4 - heading 5 - from bash-4.4$ rm toto 

or to keep all & using sed only for fun :

bash-4.4$ sed -e 's/>\(.*\)</></;s/>/\n/g;s/</- /g' toto | sed '/^$/ d' | cat -n 1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note 11 bash-4.4$ 
1

Here's an XQuery solution just in case you want something that works on ANY XML, even awkward XML containing comments, DTDs, self-closing elements, etc.

declare function local:f($e) { $e / (name(), local:f(*), ('/' || name())) }; for $tag at $p in local:f(/*) return ($p || ' - ' || $tag || '&#xa;') 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.