So - the way you’re doing, there are two things that need all the records in memory at the same time - one of them is Pandas - it will read all your database records and create a dataframe with all the data in memory in the format used by Python.
If you were really using any of the characteristics of Pandas, we would have to paginate the reading of the bank (with the clause LIMIT
in the query), and create partial dataframes - if any Pandas result depended on all records, we would have to think of strategies to group this, etc... (but look at the project Dask if you need something like this).
If you want to create a single XML file with all 1 million records, then we have to address the second "villain" - as XML has an element that is "parent" of all others (the "add" element in this case), it would be the case to write the opening lines of this element manually in the output file, write the desired blocks, and manually write the </add>
at the end of the file. It’s not hard to do - but it would be about 10-15 minutes if I were to write everything here.
Already, to generate several XML files, each with 5 thousand records, the thing is much simpler - just associate a counter in its output function, and create a new file every time the limit is reached:
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import sqlalchemy
from datetime import datetime
RECORDS_PER_FILE = 5000
def main(query):
# Não é necessário carregar a query toda para um dataframe
# df = np.array(pd.read_sql(query, engine))
gerar_xml(query)
def write_xml(element, num):
arquivo = ET.ElementTree(element)
arquivo.write(f'Name{num:03d}.xml', encoding='UTF-8', xml_declaration=True)
def gerar_xml(records):
add = None
file_num = 0
# voce nao ganha nada de chamar a variavel do for de "i" -
# deixe um nome que representa o que ela é. "i" é usado em linguagens
# que só tem for numérico, e dentro do for você precisa recuperar o
# objeto
for index, record in enumerate(records):
if index % RECORDS_PER_FILE == 0:
if add:
write_xml(add, file_num)
file_num += 1
add = ET.Element('add')
# o seu código não mostra de onde veio 'var'. De qualquer forma
# "somar" strings para compor uma string maior é hábito de javasctipr
# que não tem muitos métodos para strings. Em Python,
# use "f-strings". E mesmo assim, somar strings vazias é uma "não operação": não faz nada
doc = ET.Element('doc', abc='' + str(var) + '')
ET.SubElement(doc, 'field', name='id').text = str(record[0])
ET.SubElement(doc, 'field', name='cc').text = str(record[1])
ET.SubElement(doc, 'field', name='db').text = str(record[2])
...
Diversos campos e if para controle de outro elementos
...
add.append(doc)
# Escreve os registros finais
write_xml(add, file_num)
if __name__ == '__main__':
engine = sqlalchemy.create_engine('postgresql+psycopg2://user:pwd@Server:PORT/Base')
query = "select * from table"
main(query)
But the good news is that since you are not using anything from Pandas, just iterating the records, the Postgres query you use already does this -
(now yes :-) - will only work a little weave some considerations there - but we find a way)
– jsbueno
:-) vlw @jsbueno! .
– Luciano