How can I find the n-th '<' symbol containing word in an XML-like text file?

Question

I have an XML-like text file, which cannot be parsed with an XML parser due to XML violations:

<note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

I just want to cut the word after n th tag open < in a file. The file should be in XML-grammar, which means lines can vary.

My expected output would be,

1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note

This task is far from trivial: consider that your input xml structure has arbitrary nested levels: <note> <to><to_1><to_1_1>to_child1_1</to_1_1><to_1_2>to_child2_2</to_1_2></to_1><to_2>to_child2</to_2></to> <from>Jani</from><heading>Reminder</heading></note> What should be the 5th node in such case? — RomanPerekhrest
– RomanPerekhrest, Commented Feb 15, 2018 at 6:35
@RomanPerekhrest Updated My question to clarify the requirment — Bhanuchander Udhayakumar
– Bhanuchander Udhayakumar, Commented Feb 15, 2018 at 6:43
Your requirements say you're only interested in opening tags but your example suggests you are also interested in closing tags. What about self-closing tags? And what is a "word" - do you mean any valid XML name, including colons? — Michael Kay
– Michael Kay, Commented Feb 15, 2018 at 9:31
Also, I don't understand all the solutions, but I think they might not behave the way you want if there are attributes in the start tags. (But since you haven't said what you want, perhaps you don't care.) — Michael Kay
– Michael Kay, Commented Feb 15, 2018 at 9:44

Kamaraj · Accepted Answer · 2018-02-15 09:10:43Z

$ awk -F"[<>]" '{for(i=2;i<=NF;i+=2){print ++j" - "$i}}' input.xml 1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note

Kusalananda · Accepted Answer · 2018-02-15 10:42:53Z

Note: This answer was written before the user explained that the XML was not well formed. I'm leaving it here as it may possibly help others.

XMLStarlet is able to produce the element structure of XML documents:

$ xml el file.xml note note/to note/from note/heading note/body

This is different from your expected output, but may be enough for what you want to achieve.

It is also able to convert the XML to PYX, which shows the opening and closing tags on separate lines:

$ xml pyx file.xml (note -\n (to -Tove )to -\n (from -Jani )from -\n (heading -Reminder )heading -\n (body -Don't forget me this weekend! )body -\n )note

From this, it's easy to get exactly the output you are after:

$ xml pyx file.xml | sed -n -e 's/^(//p' -e 's/^)/\//p'| nl 1 note 2 to 3 /to 4 from 5 /from 6 heading 7 /heading 8 body 9 /body 10 /note

The sed instructions removes lines not starting with either ( or ) and replaces these characters according to how you specified it in the question. The nl utility puts line number on lines.

XMLStarlet is sometimes installed as xmlstarlet rather than xml.

RomanPerekhrest · Accepted Answer · 2018-02-15 07:39:08Z

2

grep + awk solution:

grep -Eo '<[^<>]+>' input.xml | awk '{ gsub(/[<>]/,""); printf "%-3s - %s\n", NR, $0 }'

The output:

1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note

Or with single GNU awk command:

awk -v FPAT='</?[^<>]+>' '{ for(i=1;i<=NF;i++) printf "%-3s - %s\n", ++c, $i }' input.xml

edited Feb 15, 2018 at 7:39

answered Feb 15, 2018 at 7:31

RomanPerekhrest

30.9k5 gold badges47 silver badges68 bronze badges

@Uchiha_Itachi, you're welcome. You may also try my second GNU awk one-liner

RomanPerekhrest
– RomanPerekhrest

2018-02-15 07:39:34 +00:00
Commented Feb 15, 2018 at 7:39
@user32929 You are exactly right. Here i am stuck with a non-xml file which was created to be a xml file. Some how it errored and can't parse it. Thank you for your response.

Bhanuchander Udhayakumar
– Bhanuchander Udhayakumar

2018-02-15 09:38:04 +00:00
Commented Feb 15, 2018 at 9:38
Unfortunately there are a lot of people trying to write XML without using a proper library and getting it wrong, forcing the recipients to parse the bad XML with using a proper library either. If these people were plumbers or electricians they would lose their licence to practice.

Michael Kay
– Michael Kay

2018-02-15 11:52:42 +00:00
Commented Feb 15, 2018 at 11:52

Add a comment |

francois P · Accepted Answer · 2018-02-15 07:38:45Z

here is an quite easy method to answer your question on extracting openning tags... but your example ask also for closing ones .... this seems nosense because a closed one is open of course.... do you really need also closed ones but if you wanna control xml format but the use a tool like xmllint ....

bash-4.4$ cat > toto <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note> bash-4.4$ awk '{ match($0,/<\/.*>/); b=substr($0,RSTART,RLENGTH); if(b) {a[++i]=b} } END{ {for(k in a) {c[a[k]]=k} } {for(u in c) {gsub(/\//,X,u);print u} } }' toto | sed 's/</- /;s/>//' | cat -n 1 - body 2 - note 3 - to 4 - heading 5 - from bash-4.4$ rm toto

or to keep all & using sed only for fun :

bash-4.4$ sed -e 's/>\(.*\)</></;s/>/\n/g;s/</- /g' toto | sed '/^$/ d' | cat -n 1 - note 2 - to 3 - /to 4 - from 5 - /from 6 - heading 7 - /heading 8 - body 9 - /body 10 - /note 11 bash-4.4$

Michael Kay · Accepted Answer · 2018-02-15 09:36:46Z

Here's an XQuery solution just in case you want something that works on ANY XML, even awkward XML containing comments, DTDs, self-closing elements, etc.

declare function local:f($e) { $e / (name(), local:f(*), ('/' || name())) }; for $tag at $p in local:f(/*) return ($p || ' - ' || $tag || '&#xa;')

Stack Exchange Network

How can I find the n-th '<' symbol containing word in an XML-like text file?

5 Answers 5

You must log in to answer this question.

Hot Network Questions

How can I find the n-th '<' symbol containing word in an XML-like text file?

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions