0

I want to parse a file which looks like this:

<item> <one-of> <item> deepa vats </item> <item> deepa <ruleref uri="#Dg-e_n_t41"/> </item> </one-of> <tag> out = "u-dvats"; </tag> </item> <item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>

The result should be username:out for example:

deepa vats : u-dvats and maitha al owais : u-mal_owais 

to extract the username i tried

print ([j for i,j in re.findall(r"(<item>)\s*(.*?)\s*(?!\1)(?:</item>)",line)]) if len(list1) != 0: print(list1[0].split("<item>")[-1]) 
1

2 Answers 2

1

You can parse the xml with objectify from lxml.

To parse an XML string you could use objectify.fromstring(). Then you can use dot notation or square bracket notation to navigate through the element and use the text property to get the text inside the element. Like so:

item = objectify.fromstring(item_str) item_text = item.itemchild['anotherchild'].otherchild.text 

From there you can manipulate the string and format it.

In this case I can see that you want the text inside item >> one-of >> item and the text inside item >> tag. In order to get it we could do something like this:

>>> from lxml import objectify >>> item_str = '<item> <one-of> <item> maitha al owais </item> <item> doctor maitha </item> <item> maitha <ruleref uri="#Dg-clinical_nutrition24"/> </item> </one-of> <tag> out = "u-mal_owais"; </tag> </item>' >>> item = objectify.fromstring(item_str) >>> item_text = item['one-of'].item.text >>> tag_text = item['tag'].text >>> item_text ' maitha al owais ' >>> tag_text ' out = "u-mal_owais"; ' 

Since python doesn't allow hyphens in variable names and since tag is a property of the objectify object you have to use bracket notation instead of dot notation in this case.

Sign up to request clarification or add additional context in comments.

3 Comments

hi, i have few lines like this <item> avish hodarkar <tag> out = "u-ahodarkar"; </tag> </item>, where the one-of tag is not there. how to check this one out.
You could use a try/except or check item.__dict__ to see if the child element exists like: if 'one-of' in item.__dict__: # get child content
Hey hi, How do i extract the all the usernames present in between <one-of>
1

I suggest using BeautifulSoup:

import bs4 soup = bs4.BeautifulSoup(your_text, "lxml") ' '.join(x.strip() for x in soup.strings if x.strip()) #'deepa vats deepa out = "u-dvats"; maitha al owais doctor maitha maitha out = "u-mal_owais";' 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.