How to create and read an XML with Python?

Asked

Viewed 16,338 times

8

How to create and read an XML with the component DOM in Python?

And how to read an XML with the component cElementTree python?

  • Are there any specific problems that you’re having? What have you been able to do? We need objective questions to provide more accurate and lasting answers.

3 answers

9

Python has two ways built-in to handle XML files: the xml.etree.ElementTree and the xml.dom.minidom. In addition, there are external libraries that can greatly simplify the work of handling XML, such as Beautifulsoup, the pyquery and the xmltodict (in addition to native implementations with compatible API, such as lxml). That is, there is no lack of option, the question is which one fits your needs best.

Elementtree

According to the documentation, ElementTree is "recommended for those who have no previous experience working with the DOM". It represents an XML file and its elements in Python objects with its own API, and allows you to modify and convert them back to XML format. Also supports a subset of Xpath - that you can use in consultations.

Note: THE cElementTree that you mentioned in the question is simply a C implementation of the API ElementTree (that is, after installed, the use is equal).

Pros: simple and "pitonic" to use, Xpath support. Cons: none. Example:

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('country_data.xml')
>>> root = tree.getroot()

>>> [(x.tag, x.attrib) for x in root] # Lista os elementos filhos: nome e atributos
[('country', {'name':'Liechtenstein'}), (...), (...)]

>>> root[0][8].text # Acessa um sub-elemento por índice, obtém seu texto
'2008'

>>> [x.attrib for x in root.iter('neighbor')] # Lista elementos descendentes: atributos
[{'name': 'Austria', 'direction': 'E'}, {...}, {...}, ...]

>>> atualizar = root.iter['rank'][0]
>>> atualizar.text = "1"
>>> atualizar.set('updated', 'yes')
>>> root.write('output.xml')

>>> a = ET.Element('a')
>>> b = ET.SubElement(a, 'b')
>>> c = ET.SubElement(a, 'c')
>>> d = ET.SubElement(c, 'd')
>>> ET.dump(a)
"<a><b /><c><d /></c></a>"

minidom

Minimal DOM implementation, with API similar to other languages, such as Javascript. For those already familiar with handling pure Javascript DOM (i.e. no external libraries), and want to manipulate Python XML using similar code.

Pros: API similar to Javascript. Cons: quite verbose. Example: See the reply from @utluiz.

lxml

Binding "Pythonic" for C libraries libxml2 and libxslt. Efficient and complete (Feature rich), and with a simple and compatible with the ElementTree.

Pros: performance. Cons: none. Example: Mostly identical to ElementTree (only changes the import xml... for import lxml...).

Beautifulsoup

Its main use is to interpret/manipulate HTML, but it supports XML as well. Its main feature is to be quite robust when your input files are not necessarily well formatted.

Pros: robustness. Cons: a little more verbose when modifying/creating. Example:

>>> from bs4 import BeautifulSoup
>>> root = BeautifulSoup(open('country_data.xml'))

>>> [(x.name, x.attrs) for x in root.children] # Lista os elementos filhos: nome e atributos
[('country', {'name':'Liechtenstein'}), (...), (...)]

>>> root.contents[0].contents[7].string # Acessa um sub-elemento por índice, obtém seu texto
'2008'

>>> [x.attrs for x in root.find_all('neighbor')] # Lista elementos descendentes: atributos
[{'name': 'Austria', 'direction': 'E'}, {...}, {...}, ...]

>>> atualizar = root.rank # "atalho" para root.find_all('rank')[0]
>>> atualizar.string = "1"
>>> with open('output.xml') as f:
...     f.write(unicode(root))

>>> soup = BeautifulSoup
>>> a = soup("<a />")
>>> a.append(soup.new_tag("b"))
>>> c = soup.new_tag("c")
>>> a.append(c)
>>> c.append(soup.new_tag("d"))
>>> str(soup)
"<a><b /><c><d /></c></a>"

pyquery

Library that tries to "mimic" the jQuery in a Python environment. For those who already have familiarity with the use of this framework and want to manipulate XML in Python using similar code. It depends on the lxml.

Pros: jQuery!!! Cons: weak documentation [as regards uncovered cases, where the fallback gets pro lxml]. Example:

>>> from pyquery import PyQuery as pq
>>> root = pq(filename='country_data.xml')

>>> root.children().map(lambda x: (x.tag, x.attrib)) # Lista os elementos filhos: nome e atributos
[('country', {'name':'Liechtenstein'}), (...), (...)]

>>> root.children(":eq(0)").children(":eq(7)").text() # Acessa um sub-elemento por índice, obtém seu texto
'2008'

>>> root.find('neighbor').map(lambda x: x.attrib) # Lista elementos descendentes: atributos

>>> atualizar = root.find('rank:eq(0)').text('1')
>>> with open('output.xml') as f:
...     f.write(unicode(root))

>>> print pq('a')\
...   .append('b')\
...   .append(pq('c').append('d'))
"<a><b /><c><d /></c></a>"

xmltodict

Converts an XML file into one dict simple, which can be accessed and manipulated simply through the keys and values. It can also be converted back into XML. Supports namespaces, through an extra parameter when doing parse.

Pros: Super simple and homogeneous API in your operations. Cons: poor documentation. Example: See the reply by @Avelino.

7

You can use the library xml.dom.minidom.

I made the following implementation in Python 3.3 to read an XML:

from xml.dom import minidom

xml ="""<raiz>
    <itens>
        <item name="item1">Item 1</item>
        <item name="item2">Item 2</item>
        <item name="item3">Item 3</item>
    </itens>
</raiz> 
"""

#ler do arquivo
#xmldoc = minidom.parse('itens.xml')

#ler da string
xmldoc = minidom.parseString(xml)

itemlist = xmldoc.getElementsByTagName('item') 
print('Quantidade de itens:', len(itemlist))
for s in itemlist:
    print(s.attributes['name'].value, ' =', s.firstChild.nodeValue)

And to create an XML:

#cria documento
doc = minidom.Document()

#cria raiz e adicionar no documento
raiz = doc.createElement('raiz')
doc.appendChild(doc.createElement('raiz'))

#cria itens e adiciona na raiz
itens = doc.createElement('itens')
raiz.appendChild(itens)

#cria itens e textos
for i in range(3):
    item = doc.createElement('item')
    item.setAttribute('name', 'item' + str(i+1))
    itens.appendChild(item)
    item.appendChild( doc.createTextNode('Item ' + str(i + 1)))

#xmldoc = minidom.Document()
print(raiz.toprettyxml())

Just note that the minidom documentation advises not to use it in case of XML processing from unreliable sources due to some vulnerabilities.


As to the cElementTree, I haven’t installed it to test, but the use seems very direct according to the example of documentation:

import cElementTree

for event, elem in cElementTree.iterparse(file):
    if elem.tag == "record":
        ... process record element ...
        elem.clear()

Basically:

  • cElementTree.iterparse(file) read the file
  • the loop is invoked for each tag event
  • the if test to see if the event was caused by a given tag, allowing you to process it as needed.

There are several examples here.

  • 1

    The cElementTree appears to be simply a C implementation of the API ElementTree, ie: the use should be exactly the same (or am I mistaken?). I did it a comparative of the major libraries supported by Python.

  • 1

    @mgibsonbr That’s exactly it. And another thing I hadn’t noticed is that it seems that this implementation already comes with Python: "cElementTree is included with Python 2.5 and later, as xml.etree.cElementTree.".

  • @mgibsonbr I just don’t know if it stays the same in Python 3. I’m still a little lost with these differences between version 2 and 3. Some documentations just don’t mention it or are outdated. But everything can be my ignorance, because I started learning Python only 3 weeks ago... :)

  • I believe that "Cpython", whenever it defines a "generic" API and needs to give a concrete implementation, uses a C implementation. So I don’t think Python 3 has "regressed" at this point... But it’s just a hunch, I’ve never actually even tried Python 3...

3

There are several ways to read XML with Python, one of the simplest ways is the Xmltodict, it converts the XML structure to a Dict (Python dictionary):

https://pypi.python.org/pypi/xmltodict

Take an example:

```python
>>> doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="complex">
...     element as well
...   </plus>
... </mydocument>
... """)
>>>
>>> doc['mydocument']['@has']
u'an attribute'
>>> doc['mydocument']['and']['many']
[u'elements', u'more elements']
>>> doc['mydocument']['plus']['@a']
u'complex'
>>> doc['mydocument']['plus']['#text']
u'element as well'

Browser other questions tagged

You are not signed in. Login or sign up in order to post.