Generate XML file of more than 1 Mi of records with Xml.etree.elementtree in python

Asked

Viewed 282 times

1

Please, how can I generate an XML file, with Xml.etree.elementtree in python, but in such a way that it creates the file of 5 thousand in 5 thousand resgitros, without the need to wait to load the whole recordset.

  1. The data is extracted from DB Postgresql with Pandas and Numpy.
  2. What I’ve done so far, it’s working, but without the generation partial, as described above.

Code:

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import sqlalchemy
from datetime import datetime

def main(query):
    df = np.array(pd.read_sql(query, engine))
    gerar_xml(df)

def gerar_xml(rs):
    add = ET.Element('add')

    for i in rs:
        doc = ET.Element('doc', abc='' + str(var) + '')
        ET.SubElement(doc, 'field', name='id').text = str(i[0])
        ET.SubElement(doc, 'field', name='cc').text = str(i[1])
        ET.SubElement(doc, 'field', name='db').text = str(i[2])

        ...
            Diversos campos e if para controle de outro elementos
        ...

        add.append(doc)

    arquivo = ET.ElementTree(add)
    arquivo.write('Name.xml', encoding='UTF-8', xml_declaration=True)

if __name__ == '__main__':
    engine = sqlalchemy.create_engine('postgresql+psycopg2://user:pwd@Server:PORT/Base')

    query = "select * from table"

    main(query)
  • (now yes :-) - will only work a little weave some considerations there - but we find a way)

  • :-) vlw @jsbueno! .

1 answer

0

So - the way you’re doing, there are two things that need all the records in memory at the same time - one of them is Pandas - it will read all your database records and create a dataframe with all the data in memory in the format used by Python.

If you were really using any of the characteristics of Pandas, we would have to paginate the reading of the bank (with the clause LIMIT in the query), and create partial dataframes - if any Pandas result depended on all records, we would have to think of strategies to group this, etc... (but look at the project Dask if you need something like this).

If you want to create a single XML file with all 1 million records, then we have to address the second "villain" - as XML has an element that is "parent" of all others (the "add" element in this case), it would be the case to write the opening lines of this element manually in the output file, write the desired blocks, and manually write the </add> at the end of the file. It’s not hard to do - but it would be about 10-15 minutes if I were to write everything here.

Already, to generate several XML files, each with 5 thousand records, the thing is much simpler - just associate a counter in its output function, and create a new file every time the limit is reached:

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import sqlalchemy
from datetime import datetime

RECORDS_PER_FILE = 5000

def main(query):
    # Não é necessário carregar a query toda para um  dataframe
    # df = np.array(pd.read_sql(query, engine))
    gerar_xml(query)

def write_xml(element, num):
    arquivo = ET.ElementTree(element)
    arquivo.write(f'Name{num:03d}.xml', encoding='UTF-8', xml_declaration=True)

def gerar_xml(records):
    add = None
    file_num = 0

    # voce nao ganha nada de chamar a variavel do for de "i" - 
    # deixe um nome que representa o que ela é. "i" é usado em linguagens
    # que só tem for numérico, e dentro do for você precisa recuperar o
    # objeto
    for index, record in enumerate(records):
        if index % RECORDS_PER_FILE == 0:
            if add:
                write_xml(add, file_num)
                file_num += 1
            add = ET.Element('add')
        # o seu código não mostra de onde veio 'var'. De qualquer forma
        # "somar" strings para compor uma string maior é hábito de javasctipr
        # que não tem muitos métodos para strings. Em Python,
        # use "f-strings". E mesmo assim, somar strings vazias é uma "não operação": não faz nada
        doc = ET.Element('doc', abc='' + str(var) + '')
        ET.SubElement(doc, 'field', name='id').text = str(record[0])
        ET.SubElement(doc, 'field', name='cc').text = str(record[1])
        ET.SubElement(doc, 'field', name='db').text = str(record[2])

        ...
            Diversos campos e if para controle de outro elementos
        ...

        add.append(doc)

    # Escreve os registros finais
    write_xml(add, file_num)


if __name__ == '__main__':
    engine = sqlalchemy.create_engine('postgresql+psycopg2://user:pwd@Server:PORT/Base')

    query = "select * from table"

    main(query)

But the good news is that since you are not using anything from Pandas, just iterating the records, the Postgres query you use already does this -

  • Got it @jsbueno, I’ll make a few more attempts here. Thank you!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.