How to remove an Element from an XML with Python?

Asked

Viewed 989 times

8

The case is that I have a file produced by a Garmin (GPS exercise device) and I want to remove all fields related to the heartbeat to pass the file to an athlete who did the exercise with me. The file is in GPX format and is more or less like this:

<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" ...>
  <metadata>...</metadata>
  <trk>
    <trkseg>
      <trkpt lon="00" lat="00">
        <ele>000</ele>
        <time>2014-01-01T00:00:00.000Z</time>
        <extensions>
          <gpxtpx:TrackPointExtension>
            <gpxtpx:hr>99</gpxtpx:hr>
          </gpxtpx:TrackPointExtension>
        </extensions>
      </trkpt>
      ....
      <trkpt ...>
        ...
        <extensions>
          ...
        </extensions>
      </trkpt>
    </trkseg>
  </trk>
</gpx>

The system basically generates an element <trkpt> every reading (geographical + physiological + other devices). I need to remove all instances of the element <extensions> within the <trkpt> (i.e., all the contents of it). I tried using the library ElementTree with the following code:

import xml.etree.ElementTree as ET
tree = ET.parse('input.gpx')
root = tree.getroot()
for ext in root[1][2].iter('{http://www.topografix.com/GPX/1/1}trkpt'):
  ext = trkpt.find('{http://www.topografix.com/GPX/1/1}extensions')
  root.remove(ext)
tree.write('output.gpx')

The code even removes the elements, but I didn’t like 3 things here:

The first is that the library adds the XML schema Urls to the element names. I lost a lot of time without understanding why my algorithm couldn’t find the elements...

The second is this root[1][2] to have a pointer to the father of the elements I want to remove. I could access the elements directly by invoking root.iter('{...}extensions').

And finally, the most serious issue is that when writing the result in the file I realized that the library renames the tags breaks the original format. The result was so:

<?xml version='1.0' encoding='UTF-8'?>
<ns0:gpx ...>
  <ns0:metadata>...</ns0:metadata>
  <ns0:trk>...</ns0:trk>
</ns0:gpx>

As I have no experience with this library maybe I’m missing some configuration I didn’t see in my superficial reading of documentation. So I’m looking for a solution to my problem with this or another library.

  • 2

    Does it need to be in Python? It is possible to do this with sed: sed '/<Extensions>/,/</Extensions>/d' input.gpx

  • I appreciate the tip Francisco. I’ve even solved using something very similar inside Vim, but I’m trying my pythonese better. []

  • Ah, right. It’s just that you said it was just to pass a file to a friend, I thought you just wanted a quick fix. :)

  • 1

    I have no time to write an example now, so I will comment and not create a response. But take a look at the lib Beautifulsoup, very used for this type of script.

  • Thanks @Thiago-silva. I used your recommendation and posted a reply here. Very cool to Beautifulsoup.

5 answers

6


I followed the hint left in the comments of the question and solved the problem using the library Beautifulsoup 4 (Thank you @Thiago-silva)

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('input.gpx'), 'xml')
for ext in soup.find_all('extensions'):
  removed = ext.extract()

output = open('output.gpx','w')
output.write(soup.prettify())
output.close()

2

I recommend using the library lxml for the performance and simplicity of the:

from lxml import etree

gpx = etree.parse(open('input.gpx'))

for node in gpx.xpath('//trkpt/extensions'):
    node.getparent().remove(node)

gpx.write(open('output.gpx', 'w'))

I used Xpath to simplify things.

2

The easiest way to mess with XML that I found until today was using xmltodict, That doesn’t mean it’s performatic.

Follow the example of how to use:

doc = xmltodict.parse("""
<mydocument has="an attribute">
<and>
<many>elements</many>
<many>more elements</many>
</and>
<plus a="complex">
element as well
</plus>
</mydocument>
""")

print doc['mydocument']['@has']
del doc['mydocument']['and']
unparse(doc)

After deleting the node with del you make a unparse() and it generates XML!

1

I did a test here with your code and the elements 'Extensions' were not removed (maybe because they are not root? children). Anyway, the only difference I noticed is that your source file is encoded in utf8 and output you encode in ascii (second the documentation from Elementtree, the encoding pattern in the write method is asc). Try using the encoding in utf8 and see if the result is more matching.

The code that I used here (and that actually removed the desired items) is like this:

import xml.etree.ElementTree as ET
tree = ET.parse('input.gpx')

for node in tree.iter():
    for child in node:
        if child.tag == 'extensions':
            node.remove(child)

tree.write('output.gpx', encoding='UTF-8')
  • Opa Luiz. Thanks for the answer. I fixed the solution I presented initially (really was broken) but I continue with the initial problem. I took the opportunity to improve the description of the problems. Please edit your code to indent the removal command. Has a <TAB> getting in the way. I tried to edit but the system prohibits 1 character edits.

  • Oops, for nothing. I was going to go back to study your question, but I just saw that you already solved. :)

0

Browser other questions tagged

You are not signed in. Login or sign up in order to post.