How can I parse XML and get instances of a particular node attribute?

Question

I have many rows in XML and I'm trying to get instances of a particular node attribute.

<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo>

How do I access the values of the attribute foobar? In this example, I want "1" and "2".

Related: Python xml ElementTree from a string source?

Stevoisiak
– Stevoisiak

2017-11-02 16:08:41 +00:00
Commented Nov 2, 2017 at 16:08 — Stevoisiak
– Stevoisiak, Commented Nov 2, 2017 at 16:08

Mateen Ulhaq · Accepted Answer · 2022-04-09 10:39:38Z

916

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:

import xml.etree.ElementTree as ET root = ET.parse('thefile.xml').getroot()

Or any of the many other ways shown at ElementTree. Then do something like:

for type_tag in root.findall('bar/type'): value = type_tag.get('foobar') print(value)

Output:

1 2

edited Apr 9, 2022 at 10:39

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Dec 16, 2009 at 5:21

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

John Machin Over a year ago

You seem to ignore xml.etree.cElementTree which comes with Python and in some aspects is faster tham lxml ("lxml's iterparse() is slightly slower than the one in cET" -- e-mail from lxml author).

Samuel Over a year ago

ElementTree works and is included with Python. There is limited XPath support though and you can't traverse up to the parent of an element, which can slow development down (especially if you don't know this). See python xml query get parent for details.

Saheel Godhane Over a year ago

lxml adds more than speed. It provides easy access to information such as parent node, line number in the XML source, etc. that can be very useful in several scenarios.

Cristik Over a year ago

Seems that ElementTree has some vulnerability issues, this is a quote from the docs:

Warning The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

gitaarik Over a year ago

@Cristik This seems to be the case with most xml parsers, see the XML vulnerabilities page.

|

Mateen Ulhaq · Accepted Answer · 2022-04-09 10:52:01Z

477

minidom is the quickest and pretty straight forward.

XML:

<data> <items> <item name="item1"></item> <item name="item2"></item> <item name="item3"></item> <item name="item4"></item> </items> </data>

Python:

from xml.dom import minidom dom = minidom.parse('items.xml') elements = dom.getElementsByTagName('item') print(f"There are {len(elements)} items:") for element in elements: print(element.attributes['name'].value)

Output:

There are 4 items: item1 item2 item3 item4

edited Apr 9, 2022 at 10:52

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Dec 16, 2009 at 5:30

Ryan Christensen

7,9531 gold badge29 silver badges25 bronze badges

9 Comments

swmcdonnell Over a year ago

How do you get the value of "item1"? For example: <item name="item1">Value1</item>

amphibient Over a year ago

where is the documentation for minidom ? I only found this but that doesn't do: docs.python.org/2/library/xml.dom.minidom.html

amphibient Over a year ago

I am also confused why it finds item straight from the top level of the document? wouldn't it be cleaner if you supplied it the path (data->items)? because, what if you also had data->secondSetOfItems that also had nodes named item and you wanted to list only one of the two sets of item?

amphibient Over a year ago

please see stackoverflow.com/questions/21124018/…

Alex Borsody Over a year ago

The syntax won't work here you need to remove parenthesis for s in itemlist: print(s.attributes['name'].value)

|

Peter Mortensen · Accepted Answer · 2024-12-02 17:31:42Z

273

You can use Beautiful Soup:

from bs4 import BeautifulSoup x="""<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo>""" y=BeautifulSoup(x) >>> y.foo.bar.type["foobar"] u'1' >>> y.foo.bar.findAll("type") [<type foobar="1"></type>, <type foobar="2"></type>] >>> y.foo.bar.findAll("type")[0]["foobar"] u'1' >>> y.foo.bar.findAll("type")[1]["foobar"] u'2'

edited Dec 2, 2024 at 17:31

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Dec 16, 2009 at 5:12

YOU

124k34 gold badges191 silver badges222 bronze badges

11 Comments

cblab Over a year ago

three years later with bs4 this is a great solution, very flexible, especially if the source is not well formed

andilabs Over a year ago

@YOU BeautifulStoneSoup is DEPRECIATED. Just use BeautifulSoup(source_xml, features="xml")

ViKiG Over a year ago

Another 3 years later, I just tried to load XML using ElementTree, unfortunately it is unable to parse unless I adjust the source at places but BeautifulSoup worked just right away without any changes!

jpmc26 Over a year ago

@andi You mean "deprecated." "Depreciated" means it decreased in value, usually due to age or wear and tear from normal use.

Elvin Aghammadzada Over a year ago

another 3 years and now BS4 is not fast enough. Takes ages. Looking for any faster solutions

|

Benjamin Loison · Accepted Answer · 2024-04-10 13:31:29Z

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

The relevant metrics can be found in the table below, copied from the cElementTree website:

library time space xml.dom.minidom (Python 2.1) 6.3 s 80000K gnosis.objectify 2.0 s 22000k xml.dom.minidom (Python 2.4) 1.4 s 53000k ElementTree 1.2 1.6 s 14500k ElementTree 1.2.4/1.3 1.1 s 14500k cDomlette (C extension) 0.540 s 20500k PyRXPU (C extension) 0.175 s 10850k libxml2 (C extension) 0.098 s 16000k readlines (read as utf-8) 0.093 s 8850k cElementTree (C extension) --> 0.047 s 4900K <-- readlines (read as ascii) 0.032 s 5050k

As pointed out by @jfs, cElementTree comes bundled with Python:

Python 2: from xml.etree import cElementTree as ElementTree.
Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).

Are there any downsides to using cElementTree? It seems to be a no-brainer.
Apparently they don't want to use the library on OS X as I have spend over 15 minutes trying to figure out where to download it from and no link works. Lack of documentation prevents good projects from thriving, wish more people would realize that.
@Stunner: it is in stdlib i.e., you don't need to download anything. On Python 2: from xml.etree import cElementTree as ElementTree. On Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically)
@mayhewsw It's more effort to figure out how to efficiently use ElementTree for a particular task. For documents that fit in memory, it's a lot easier to use minidom, and it works fine for smaller XML documents.

Benjamin Loison · Accepted Answer · 2024-04-10 13:31:53Z

56

I suggest xmltodict for simplicity.

It parses your XML to an OrderedDict;

>>> e = '<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo> ' >>> import xmltodict >>> result = xmltodict.parse(e) >>> result OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))]) >>> result['foo'] OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]) >>> result['foo']['bar'] OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

edited Apr 10, 2024 at 13:31

Benjamin Loison

5,7514 gold badges20 silver badges37 bronze badges

answered Jun 12, 2014 at 11:57

myildirim

2,5083 gold badges20 silver badges26 bronze badges

5 Comments

Dan Passaro Over a year ago

Agreed. If you don't need XPath or anything complicated, this is much simpler to use (especially in the interpreter); handy for REST APIs that publish XML instead of JSON

TextGeek Over a year ago

Remember that OrderedDict does not support duplicate keys. Most XML is chock-full of multiple siblings of the same types (say, all the paragraphs in a section, or all the types in your bar). So this will only work for very limited special cases.

luator Over a year ago

@TextGeek In this case, result["foo"]["bar"]["type"] is a list of all <type> elements, so it is still working (even though the structure is maybe a bit unexpected).

kolypto Over a year ago

No updates since 2019

myildirim Over a year ago

I just realized that no updates since 2019. We need to find an active fork.

sandy · Accepted Answer · 2013-06-07 08:15:54Z

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify from collections import defaultdict count = defaultdict(int) root = objectify.fromstring(text) for item in root.bar.type: count[item.attrib.get("foobar")] += 1 print dict(count)

Output:

{'1': 1, '2': 1}

count stores the counts of each item in a dictionary with default keys, so you don't have to check for membership. You can also try looking at collections.Counter.

the Tin Man · Accepted Answer · 2019-12-12 23:20:35Z

Python has an interface to the expat XML parser.

xml.parsers.expat

It's a non-validating parser, so bad XML will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo> <bar> <type arg="value" /> <type arg="value" /> <type arg="value" /> </bar> <bar> <type arg="value" /> </bar> </foo>""" count = 0 def start(name, attr): global count if name == 'type': count += 1 p = expat.ParserCreate() p.StartElementHandler = start p.Parse(stringofxml) print count # prints 4

the Tin Man · Accepted Answer · 2019-12-12 23:26:27Z

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

Installation:

pip install untangle

Usage:

Your XML file (a little bit changed):

<foo> <bar name="bar_name"> <type foobar="1"/> </bar> </foo>

Accessing the attributes with untangle:

import untangle obj = untangle.parse('/path_to_xml_file/file.xml') print obj.foo.bar['name'] print obj.foo.bar.type['foobar']

The output will be:

bar_name 1

More information about untangle can be found in "untangle".

Also, if you are curious, you can find a list of tools for working with XML and Python in "Python and XML". You will also see that the most common ones were mentioned by previous answers.

I cannot tell you the difference between those two as I have not worked with minidom.

gatkin · Accepted Answer · 2017-09-04 17:40:26Z

I might suggest declxml.

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

Parsing into Python data structures is straightforward:

import declxml as xml xml_string = """ <foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo> """ processor = xml.dictionary('foo', [ xml.dictionary('bar', [ xml.array(xml.integer('type', attribute='foobar')) ]) ]) xml.parse_from_string(processor, xml_string)

Which produces the output:

{'bar': {'foobar': [1, 2]}}

You can also use the same processor to serialize data to XML

data = {'bar': { 'foobar': [7, 3, 21, 16, 11] }} xml.serialize_to_string(processor, data, indent=' ')

Which produces the following output

<?xml version="1.0" ?> <foo> <bar> <type foobar="7"/> <type foobar="3"/> <type foobar="21"/> <type foobar="16"/> <type foobar="11"/> </bar> </foo>

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

import declxml as xml class Bar: def __init__(self): self.foobars = [] def __repr__(self): return 'Bar(foobars={})'.format(self.foobars) xml_string = """ <foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo> """ processor = xml.dictionary('foo', [ xml.user_object('bar', Bar, [ xml.array(xml.integer('type', attribute='foobar'), alias='foobars') ]) ]) xml.parse_from_string(processor, xml_string)

Which produces the following output

{'bar': Bar(foobars=[1, 2])}

the Tin Man · Accepted Answer · 2019-12-12 23:28:04Z

Here a very simple but effective code using cElementTree.

try: import cElementTree as ET except ImportError: try: # Python 2.5 need to import a different module import xml.etree.cElementTree as ET except ImportError: exit_err("Failed to import cElementTree from any known place") def find_in_tree(tree, node): found = tree.find(node) if found == None: print "No %s in file" % node found = [] return found # Parse a xml file (specify the path) def_file = "xml_file_name.xml" try: dom = ET.parse(open(def_file, "r")) root = dom.getroot() except: exit_err("Unable to open and parse input definition file: " + def_file) # Parse to find the child nodes list of node 'myNode' fwdefs = find_in_tree(root,"myNode")

This is from "python xml parse".

the Tin Man · Accepted Answer · 2019-12-12 23:58:32Z

XML:

<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo>

Python code:

import xml.etree.cElementTree as ET tree = ET.parse("foo.xml") root = tree.getroot() root_tag = root.tag print(root_tag) for form in root.findall("./bar/type"): x=(form.attrib) z=list(x) for i in z: print(x[i])

Output:

foo 1 2

G M · Accepted Answer · 2020-06-03 16:08:15Z

xml.etree.ElementTree vs. lxml

These are some pros of the two most used libraries I would have benefit to know before choosing between them.

xml.etree.ElementTree:

From the standard library: no needs of installing any module

lxml

Easily write XML declaration: for instance do you need to add standalone="no"?
Pretty printing: you can have a nice indented XML without extra code.
Objectify functionality: It allows you to use XML as if you were dealing with a normal Python object hierarchy.node.
sourceline allows to easily get the line of the XML element you are using.
you can use also a built-in XSD schema checker.

Peter Mortensen · Accepted Answer · 2024-12-02 17:40:05Z

There's no need to use a library-specific API if you use python-benedict. Just initialize a new instance from your XML content and manage it easily since it is a dict subclass.

Installation is easy: pip install python-benedict

from benedict import benedict as bdict # 'data-source' can be a URL, a filepath or data-string (as in this example) data_source = """ <foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo>""" data = bdict.from_xml(data_source) t_list = data['foo.bar'] # yes, keypath supported for t in t_list: print(t['@foobar'])

It supports and normalizes I/O operations with many formats: Base64, CSV, JSON, TOML, XML, YAML and query string.

It is well tested and open source on GitHub. Disclosure: I am the author.

the Tin Man · Accepted Answer · 2019-12-12 23:57:42Z

import xml.etree.ElementTree as ET data = '''<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo>''' tree = ET.fromstring(data) lst = tree.findall('bar/type') for item in lst: print item.get('foobar')

This will print the value of the foobar attribute.

Peter Mortensen · Accepted Answer · 2024-12-02 17:50:30Z

simplified_scrapy: a new library. I fell in love with it after I used it. I recommend it to you.

from simplified_scrapy import SimplifiedDoc xml = ''' <foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo> ''' doc = SimplifiedDoc(xml) types = doc.selects('bar>type') print (len(types)) # 2 print (types.foobar) # ['1', '2'] print (doc.selects('bar>type>foobar()')) # ['1', '2']

Here are more examples. This library is easy to use.

Peter Mortensen · Accepted Answer · 2024-12-02 17:52:21Z

Use pandas. Pandas have a function read_xml(), what is perfect for such flat XML structures.

import pandas as pd xml = """<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo>""" df = pd.read_xml(xml, xpath=".//type") print(df)

Output:

 foobar 0 1 1 2

Siraj · Accepted Answer · 2020-02-20 12:56:52Z

#If the xml is in the form of a string as shown below then from lxml import etree, objectify '''sample xml as a string with a name space {http://xmlns.abc.com}''' message =b'<?xml version="1.0" encoding="UTF-8"?>\r\n<pa:Process xmlns:pa="http://xmlns.abc.com">\r\n\t<pa:firsttag>SAMPLE</pa:firsttag></pa:Process>\r\n' # this is a sample xml which is a string print('************message coversion and parsing starts*************') message=message.decode('utf-8') message=message.replace('<?xml version="1.0" encoding="UTF-8"?>\r\n','') #replace is used to remove unwanted strings from the 'message' message=message.replace('pa:Process>\r\n','pa:Process>') print (message) print ('******Parsing starts*************') parser = etree.XMLParser(remove_blank_text=True) #the name space is removed here root = etree.fromstring(message, parser) #parsing of xml happens here print ('******Parsing completed************') dict={} for child in root: # parsed xml is iterated using a for loop and values are stored in a dictionary print(child.tag,child.text) print('****Derving from xml tree*****') if child.tag =="{http://xmlns.abc.com}firsttag": dict["FIRST_TAG"]=child.text print(dict) ### output '''************message coversion and parsing starts************* <pa:Process xmlns:pa="http://xmlns.abc.com"> <pa:firsttag>SAMPLE</pa:firsttag></pa:Process> ******Parsing starts************* ******Parsing completed************ {http://xmlns.abc.com}firsttag SAMPLE ****Derving from xml tree***** {'FIRST_TAG': 'SAMPLE'}'''

Please also include some context explaining how your answer solves the issue. Code-only answers aren't encouraged.

Peter Mortensen · Accepted Answer · 2024-12-02 17:49:10Z

If you don't want to use any external libraries or third-party tools, please try the below code.

This will parse XML into a Python dictionary
This will parse XML attributes as well
This will also parse empty tags like <tag/> and tags with only attributes like <tag var=val/>

Code

import re def getdict(content): res=re.findall("<(?P<var>\S*)(?P<attr>[^/>]*)(?:(?:>(?P<val>.*?)</(?P=var)>)|(?:/>))",content) if len(res)>=1: attreg="(?P<avr>\S+?)(?:(?:=(?P<quote>['\"])(?P<avl>.*?)(?P=quote))|(?:=(?P<avl1>.*?)(?:\s|$))|(?P<avl2>[\s]+)|$)" if len(res)>1: return [{i[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,i[1].strip())]},{"$values":getdict(i[2])}]} for i in res] else: return {res[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,res[1].strip())]},{"$values":getdict(res[2])}]} else: return content with open("test.xml","r") as f: print(getdict(f.read().replace('\n','')))

Sample input

<details class="4b" count=1 boy> <name type="firstname">John</name> <age>13</age> <hobby>Coin collection</hobby> <hobby>Stamp collection</hobby> <address> <country>USA</country> <state>CA</state> </address> </details> <details empty="True"/> <details/> <details class="4a" count=2 girl> <name type="firstname">Samantha</name> <age>13</age> <hobby>Fishing</hobby> <hobby>Chess</hobby> <address current="no"> <country>Australia</country> <state>NSW</state> </address> </details>

Output (beautified)

[ { "details": [ { "@attributes": [ { "class": "4b" }, { "count": "1" }, { "boy": "" } ] }, { "$values": [ { "name": [ { "@attributes": [ { "type": "firstname" } ] }, { "$values": "John" } ] }, { "age": [ { "@attributes": [] }, { "$values": "13" } ] }, { "hobby": [ { "@attributes": [] }, { "$values": "Coin collection" } ] }, { "hobby": [ { "@attributes": [] }, { "$values": "Stamp collection" } ] }, { "address": [ { "@attributes": [] }, { "$values": [ { "country": [ { "@attributes": [] }, { "$values": "USA" } ] }, { "state": [ { "@attributes": [] }, { "$values": "CA" } ] } ] } ] } ] } ] }, { "details": [ { "@attributes": [ { "empty": "True" } ] }, { "$values": "" } ] }, { "details": [ { "@attributes": [] }, { "$values": "" } ] }, { "details": [ { "@attributes": [ { "class": "4a" }, { "count": "2" }, { "girl": "" } ] }, { "$values": [ { "name": [ { "@attributes": [ { "type": "firstname" } ] }, { "$values": "Samantha" } ] }, { "age": [ { "@attributes": [] }, { "$values": "13" } ] }, { "hobby": [ { "@attributes": [] }, { "$values": "Fishing" } ] }, { "hobby": [ { "@attributes": [] }, { "$values": "Chess" } ] }, { "address": [ { "@attributes": [ { "current": "no" } ] }, { "$values": [ { "country": [ { "@attributes": [] }, { "$values": "Australia" } ] }, { "state": [ { "@attributes": [] }, { "$values": "NSW" } ] } ] } ] } ] } ] } ]

It's a good method, but the result it returns is not convenient to use.

Hermann12 · Accepted Answer · 2023-05-21 10:39:05Z

With iterparse() you can catch the tag attribute dictionary value:

import xml.etree.ElementTree as ET from io import StringIO xml = """<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo> """ file = StringIO(xml) for event, elem in ET.iterparse(file, ("end",)): if event == "end" and elem.tag == "type": print(elem.attrib["foobar"])

Hermann12 · Accepted Answer · 2024-02-06 07:24:31Z

The classic SAX parser solution could work like:

from xml.sax import make_parser, handler from io import StringIO xml = """<foo> <bar> <type foobar="1"/> <type foobar="2"/> </bar> </foo> """ file = StringIO(xml) class MyParser(handler.ContentHandler): def startElement(self, name, attrs): if name == "type": print(attrs['foobar']) parser = make_parser() b = MyParser() parser.setContentHandler(b) parser.parse(file)

Output:

1 2

Peter Mortensen · Accepted Answer · 2024-12-02 17:45:28Z

If the source is an XML file, say like this sample,

<pa:Process xmlns:pa="http://sssss"> <pa:firsttag>SAMPLE</pa:firsttag> </pa:Process>

you may try the following code:

from lxml import etree, objectify # This is a sample XML file. The contents is shown above metadata = 'C:\\Users\\PROCS.xml' # This line removes the name space from the XML content. In # this sample, the name space is --> http://sssss parser = etree.XMLParser(remove_blank_text=True) # This line parses the XML file, which is PROCS.xml tree = etree.parse(metadata, parser) # We get the root of XML content which is processed # and iterated using a 'for' loop root = tree.getroot() for elem in root.getiterator(): if not hasattr(elem.tag, 'find'): continue # (1) i = elem.tag.find('}') if i >= 0: elem.tag = elem.tag[i+1:] dict={} # A Python dictionary is declared for elem in tree.iter(): # Iterating through the XML tree using a 'for' loop if elem.tag =="firsttag": # If the tag name matches the name that # is equated then the text in the tag # is stored into the dictionary dict["FIRST_TAG"]=str(elem.text) print(dict)

The output would be

{'FIRST_TAG': 'SAMPLE'}

Collectives™ on Stack Overflow

How can I parse XML and get instances of a particular node attribute?

21 Answers 21

11 Comments

9 Comments

11 Comments

4 Comments

5 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

xml.etree.ElementTree vs. lxml

xml.etree.ElementTree:

lxml

Comments

Comments

Comments

Comments

Comments

1 Comment

Code

Sample input

Output (beautified)

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

21 Answers 21

11 Comments

9 Comments

11 Comments

4 Comments

5 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

xml.etree.ElementTree vs. lxml

xml.etree.ElementTree:

lxml

Comments

Comments

Comments

Comments

Comments

1 Comment

Code

Sample input

Output (beautified)

1 Comment

Comments

Comments

Comments

Linked

Related