Printing specific attributes of a python XML file tag

Asked

Viewed 525 times

0

Hello,

I’m starting to use the Python language for a survey, and as "activity" I received an XML file from my teacher and I need to "unlock" it, printing tag contents and some attributes specific to these tags. What happens is that I am very beginner in language and when trying to print these attributes I am not succeeding. Reading some other forums I arrived in a code where I can print the file tags, but not its attributes, since the file is an example of PLN, where each tag represents a node with its attributes like id, word, text, lemma and etc. I will leave here the code, the current output and a piece of my file for better understanding of the problem.

Code:

import xml.etree.ElementTree as ET
import requests


arquivo = "C1_Extrato_2_Palavras.xml"
tree = ET.parse(arquivo)

root = tree.getroot()

filtro = "*"
for child in root.iter(filtro):
    print(child.tag, child.text)

print("\n")

for child in root.findall("body"):
    for esse in child.findall("graph"):
        print(esse.text)

Exit:

corpus 


body 

s 

graph 

terminals 

t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
t None
nonterminals 

nt 

edge None
nt 

edge None
edge None
edge None
edge None
nt 

Part of the XML file:

<?xml version="1.0" encoding="UTF-8"?>
<corpus>

    <body>
<s id="s1" ref="1" source="Running text" forest="1" text="Um acidente aéreo na localidade de Bukavu, no leste da República Democrática do Congo, matou 17 pessoas na quinta-feira à tarde, informou hoje um porta-voz das Nações Unidas.">
    <graph root="s1_500">
        <terminals>
            <t id="s1_1" word="Um" lemma="um" pos="art" morph="M S" extra="* "/>
            <t id="s1_2" word="acidente" lemma="acidente" pos="n" morph="M S" sem="event" extra="--"/>
            <t id="s1_3" word="aéreo" lemma="aéreo" pos="adj" morph="M S" extra="nh np-close"/>
            <t id="s1_4" word="em" lemma="em" pos="prp" morph="--" extra="sam- np-long"/>
            <t id="s1_5" word="a" lemma="o" pos="art" morph="F S" extra="-sam "/>
            <t id="s1_6" word="localidade" lemma="localidade" pos="n" morph="F S" sem="Labs Lciv" extra="--"/>
            <t id="s1_7" word="de" lemma="de" pos="prp" morph="--" extra="np-close"/>
            <t id="s1_8" word="Bukavu" lemma="Bukavu" pos="prop" morph="M/F S" extra="civ * heur"/>
            <t id="s1_9" word="," lemma="--" pos="pu" morph="--" extra="--"/>
            <t id="s1_10" word="em" lemma="em" pos="prp" morph="--" extra="sam-"/>
            <t id="s1_11" word="o" lemma="o" pos="art" morph="M S" extra="-sam "/>
            <t id="s1_12" word="leste" lemma="leste" pos="n" morph="M S" sem="dir" extra="--"/>
            <t id="s1_13" word="de" lemma="de" pos="prp" morph="--" extra="sam- np-close"/>
            <t id="s1_14" word="a" lemma="o" pos="art" morph="F S" extra="-sam "/>
            <t id="s1_15" word="República_Democrática_do_Congo" lemma="República_Democrática_do_Congo" pos="prop" morph="F S" extra="civ *"/>
            <t id="s1_16" word="," lemma="--" pos="pu" morph="--" extra="--"/>
            <t id="s1_17" word="matou" lemma="matar" pos="v-fin" morph="PS 3S IND VFIN" extra="cjt-head cjt-head-STA fmc mv"/>
            <t id="s1_18" word="17" lemma="17" pos="num" morph="F P" extra="card"/>
            <t id="s1_19" word="pessoas" lemma="pessoa" pos="n" morph="F P" sem="H" extra="--"/>
            <t id="s1_20" word="em" lemma="em" pos="prp" morph="--" extra="sam-"/>
            <t id="s1_21" word="a" lemma="o" pos="art" morph="F S" extra="-sam "/>
            <t id="s1_22" word="quinta-feira" lemma="quinta-feira" pos="n" morph="F S" sem="temp" extra="--"/>
            <t id="s1_23" word="a" lemma="a" pos="prp" morph="--" extra="sam-"/>
            <t id="s1_24" word="a" lemma="o" pos="art" morph="F S" extra="-sam "/>
            <t id="s1_25" word="tarde" lemma="tarde" pos="n" morph="F S" sem="per" extra="--"/>
            <t id="s1_26" word="," lemma="--" pos="pu" morph="--" extra="--"/>
            <t id="s1_27" word="informou" lemma="informar" pos="v-fin" morph="PS 3S IND VFIN" extra="nosubj nosubj cjt-STA vH fmc mv"/>
            <t id="s1_28" word="hoje" lemma="hoje" pos="adv" morph="--" extra="--"/>
            <t id="s1_29" word="um" lemma="um" pos="art" morph="M S" extra="--"/>
            <t id="s1_30" word="porta-voz" lemma="porta-voz" pos="n" morph="M S" sem="tool Hprof" extra="--"/>
            <t id="s1_31" word="de" lemma="de" pos="prp" morph="--" extra="sam-"/>
            <t id="s1_32" word="as" lemma="o" pos="art" morph="F P" extra="-sam "/>
            <t id="s1_33" word="Nações_Unidas" lemma="Nações_Unidas" pos="prop" morph="F P" extra="org * newlex"/>
            <t id="s1_34" word="." lemma="--" pos="pu" morph="--" extra="--"/>
        </terminals>

        <nonterminals>
            <nt id="s1_500" cat="s">
                <edge label="STA" idref="s1_501"/>
            </nt>
            <nt id="s1_501" cat="par">
                <edge label="CJT" idref="s1_502"/>
                <edge label="PU" idref="s1_26"/>
                <edge label="CJT" idref="s1_516"/>
                <edge label="PU" idref="s1_34"/>
            </nt>

Sorry for the size of the question and the formatting of the same, it is the first time I ask a question here (including accept tips/ touches).

1 answer

0

Not about the uses of the other methods, but whenever I used this module, the function I used to mount the tree was fromstring, that receives a string in XML format and returns the root of the tree.

Your code didn’t seem very incorrect to me. But you just looked in the wrong sources about how to apply your code. I always recommend reading the documentation before and ask later. As the police.

To work with the tree, you have some basic properties for reading this data. Because like any XML or HTML-like text that fits, there will be tags with text, children, attributes, etc. All this data can be treated with the ElementTree. When you get an XML element, either with the return of fromstring or in any other way, you can read all those data I have mentioned now (text, children, attributes and tag name). To access these values, we use, respectively, the properties text, list(TAG), attrib, tag.

In the extraordinary cases for obtaining children and attributes, we use the list to convert direct children from that tag to a list. While attrib returns a dictionary that links attribute name to its value.

Your code, slightly modified (I just adapted to be more "understandable" and added the reading of the attributes), would look like this:

import xml.etree.ElementTree as ET

with open("C1_Extrato_2_Palavras.xml") as XMLFile:
    textoArquivo = XMLFile.read()
    root = ET.fromstring(textoArquivo)


hasAttributes    = [] 
doesntHaveAttributes = []
for child in root.iter():
    if len(child.attrib) > 0:
        print(f'A tag {child.tag} possui estes atributos e valores:')
        hasAttributes.append(child.tag)
    else:
        print(f'A tag {child.tag} nao possui atributos.')
        doesntHaveAttributes.append(child.tag)

    for atributo in child.attrib:
        print('\t%s: %s' % (atributo, child.attrib[atributo]))
    print("\n")

print('=='*25)
print(f'Um total de {len(hasAttributes)} tags possuem atributos:', ', '.join(set(hasAttributes)))
print(f'E outro total de {len(doesntHaveAttributes)} tags NAO possuem atributos:', ', '.join(doesntHaveAttributes))
  • Thanks for the tip, I used some functions I found in the documentation, and the solution was something similar to this code of yours. But the code got shorter because of what I needed was a simple thing and I just wasn’t using the right available functions.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.