How to remove an Element from an XML with Python?

Question

How to remove an Element from an XML with Python?

Asked 10 years, 10 months ago

Viewed 989 times

8

The case is that I have a file produced by a Garmin (GPS exercise device) and I want to remove all fields related to the heartbeat to pass the file to an athlete who did the exercise with me. The file is in GPX format and is more or less like this:

<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" ...>
  <metadata>...</metadata>
  <trk>
    <trkseg>
      <trkpt lon="00" lat="00">
        <ele>000</ele>
        <time>2014-01-01T00:00:00.000Z</time>
        <extensions>
          <gpxtpx:TrackPointExtension>
            <gpxtpx:hr>99</gpxtpx:hr>
          </gpxtpx:TrackPointExtension>
        </extensions>
      </trkpt>
      ....
      <trkpt ...>
        ...
        <extensions>
          ...
        </extensions>
      </trkpt>
    </trkseg>
  </trk>
</gpx>

The system basically generates an element <trkpt> every reading (geographical + physiological + other devices). I need to remove all instances of the element <extensions> within the <trkpt> (i.e., all the contents of it). I tried using the library ElementTree with the following code:

import xml.etree.ElementTree as ET
tree = ET.parse('input.gpx')
root = tree.getroot()
for ext in root[1][2].iter('{http://www.topografix.com/GPX/1/1}trkpt'):
  ext = trkpt.find('{http://www.topografix.com/GPX/1/1}extensions')
  root.remove(ext)
tree.write('output.gpx')

The code even removes the elements, but I didn’t like 3 things here:

The first is that the library adds the XML schema Urls to the element names. I lost a lot of time without understanding why my algorithm couldn’t find the elements...

The second is this root[1][2] to have a pointer to the father of the elements I want to remove. I could access the elements directly by invoking root.iter('{...}extensions').

And finally, the most serious issue is that when writing the result in the file I realized that the library renames the tags breaks the original format. The result was so:

<?xml version='1.0' encoding='UTF-8'?>
<ns0:gpx ...>
  <ns0:metadata>...</ns0:metadata>
  <ns0:trk>...</ns0:trk>
</ns0:gpx>

As I have no experience with this library maybe I’m missing some configuration I didn’t see in my superficial reading of documentation. So I’m looking for a solution to my problem with this or another library.

2

Does it need to be in Python? It is possible to do this with sed: sed '/<Extensions>/,/</Extensions>/d' input.gpx

– user568459

2014/01/09 at 19:15
I appreciate the tip Francisco. I’ve even solved using something very similar inside Vim, but I’m trying my pythonese better. []

– Nigini

2014/01/10 at 14:02
Ah, right. It’s just that you said it was just to pass a file to a friend, I thought you just wanted a quick fix. :)

– user568459

2014/01/10 at 15:07
1

I have no time to write an example now, so I will comment and not create a response. But take a look at the lib Beautifulsoup, very used for this type of script.

– Thiago Silva

2014/01/10 at 17:24
Thanks @Thiago-silva. I used your recommendation and posted a reply here. Very cool to Beautifulsoup.

– Nigini

2014/01/10 at 18:16

5 answers

6

I followed the hint left in the comments of the question and solved the problem using the library Beautifulsoup 4 (Thank you @Thiago-silva)

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('input.gpx'), 'xml')
for ext in soup.find_all('extensions'):
  removed = ext.extract()

output = open('output.gpx','w')
output.write(soup.prettify())
output.close()

Browser other questions tagged python xml

You are not signed in. Login or sign up in order to post.

by Paulo Freitas • **754** points · Answer 1 · 2014-01-12T04:19:41+00:00

I recommend using the library lxml for the performance and simplicity of the:

from lxml import etree

gpx = etree.parse(open('input.gpx'))

for node in gpx.xpath('//trkpt/extensions'):
    node.getparent().remove(node)

gpx.write(open('output.gpx', 'w'))

I used Xpath to simplify things.

by avelino • **352** points · Answer 2 · 2014-01-29T14:24:45+00:00

The easiest way to mess with XML that I found until today was using xmltodict, That doesn’t mean it’s performatic.

Follow the example of how to use:

doc = xmltodict.parse("""
<mydocument has="an attribute">
<and>
<many>elements</many>
<many>more elements</many>
</and>
<plus a="complex">
element as well
</plus>
</mydocument>
""")

print doc['mydocument']['@has']
del doc['mydocument']['and']
unparse(doc)

After deleting the node with del you make a unparse() and it generates XML!

by Luiz Vieira • **34,160** points · Answer 3 · 2014-01-09T19:57:43+00:00

I did a test here with your code and the elements 'Extensions' were not removed (maybe because they are not root? children). Anyway, the only difference I noticed is that your source file is encoded in utf8 and output you encode in ascii (second the documentation from Elementtree, the encoding pattern in the write method is asc). Try using the encoding in utf8 and see if the result is more matching.

The code that I used here (and that actually removed the desired items) is like this:

import xml.etree.ElementTree as ET
tree = ET.parse('input.gpx')

for node in tree.iter():
    for child in node:
        if child.tag == 'extensions':
            node.remove(child)

tree.write('output.gpx', encoding='UTF-8')

by avelino • **352** points · Answer 4 · 2021-04-28T20:47:06+00:00

Like the BeautifulSoup, usually it meets most of the XML reading problems well, but it is important you see the other types of parsers it has, if you need to know how to handle.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#Installing-a-parser

For example, bs4 can use lxml which is super fast and performatic, but calls for reliance on the operating system.