parsing the text in python of the given format

Question

I want to parse a file which looks like this:

<item> <one-of> <item> deepa vats </item> <item> deepa <ruleref uri="#Dg-e_n_t41"/> </item> </one-of> <tag> out = "u-dvats"; </tag> </item> <item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>

The result should be username:out for example:

deepa vats : u-dvats and maitha al owais : u-mal_owais

to extract the username i tried

print ([j for i,j in re.findall(r"(<item>)\s*(.*?)\s*(?!\1)(?:</item>)",line)]) if len(list1) != 0: print(list1[0].split("<item>")[-1])

What have you tried? Suggested reading: How to Ask, and minimal reproducible example. — Mark Tolonen
– Mark Tolonen, Commented Sep 10, 2017 at 5:18

l4sh · Accepted Answer · 2017-09-10 06:39:57Z

You can parse the xml with objectify from lxml.

To parse an XML string you could use objectify.fromstring(). Then you can use dot notation or square bracket notation to navigate through the element and use the text property to get the text inside the element. Like so:

item = objectify.fromstring(item_str) item_text = item.itemchild['anotherchild'].otherchild.text

From there you can manipulate the string and format it.

In this case I can see that you want the text inside item >> one-of >> item and the text inside item >> tag. In order to get it we could do something like this:

>>> from lxml import objectify >>> item_str = '<item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>' >>> item = objectify.fromstring(item_str) >>> item_text = item['one-of'].item.text >>> tag_text = item['tag'].text >>> item_text ' maitha al owais ' >>> tag_text ' out = "u-mal_owais"; '

Since python doesn't allow hyphens in variable names and since tag is a property of the objectify object you have to use bracket notation instead of dot notation in this case.

hi, i have few lines like this <item> avish hodarkar <tag> out = "u-ahodarkar"; </tag> </item>, where the one-of tag is not there. how to check this one out.
You could use a try/except or check item.__dict__ to see if the child element exists like: if 'one-of' in item.__dict__: # get child content
Hey hi, How do i extract the all the usernames present in between <one-of>

DYZ · Accepted Answer · 2017-09-10 06:47:07Z

I suggest using BeautifulSoup:

import bs4 soup = bs4.BeautifulSoup(your_text, "lxml") ' '.join(x.strip() for x in soup.strings if x.strip()) #'deepa vats deepa out = "u-dvats"; maitha al owais doctor maitha maitha out = "u-mal_owais";'

Collectives™ on Stack Overflow

parsing the text in python of the given format

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related