4
I have a file with about 3 million lines. I have to read line by line and process some modifications, and after these modifications in line store the result in a list to then write to another file.
The problem is performance. It’s too slow.
I thought I’d do it this way: I will divide the file lines by 10 (e.g. 300000 lines) and process. When this 300000 lines is over, write the file. Then I read the other 300000 and so on, until the lines of the source file end.
My question is: Whereas I have a file with 3 million lines, I would like to read just a stretch of lines from the archive (from 300000 to 300000). This is possible in python?
Follows the method:
def processa_arquivos_rlt(arquivos_rlt, newFileName, sms):
try:
    for arquivo in arquivos_rlt:
        if Modulo.andamento_processo == 0:
            break
        Modulo.arquivo = 'Aguarde...'
        Modulo.arquivo = (arquivo[arquivo.rindex('/')+1:])
        contador = 1
        with open(arquivo, 'r') as linhas_rlt, open(newFileName, "at") as linhas_saida:
            for linha in linhas_rlt:
                if Modulo.andamento_processo == 0:
                    break
                item = [i.strip() for i in linha.split(";")]
                linha = Linha()
                linha.dddOrigem = item[2]
                linha.numeroOrigem = item[3]
                linha.valorBruto = item[15]
                if linha.valorBruto.find(",") > 0:
                    if len(''.join(linha.valorBruto[linha.valorBruto.rindex(",")+1:].split())) == 1:
                        linha.valorBruto = linha.valorBruto + '0'
                else:
                    if (len(linha.valorBruto)) <= 2:
                        linha.valorBruto = linha.valorBruto + '00'
                linha.valorBruto = re.sub(r'[^0-9]', '', linha.valorBruto)
                linha.dddDestino = item[7]
                linha.numeroDestino = item[8]
                linha.localidade = item[10]
                linha.codigoServico = item[17]
                linha.contrato = item[18]
                if 'claro' in arquivo.lower():
                    linha.operadora = '36'
                    #[Resolvi removendo esse trecho de código. Ao invés de executar
                    #uma consulta a cada iteração, agora eu executo a consulta apenas
                    #uma vez, coloco o resultado em uma lista e percorro essa lista. A
                    #consulta é feita apenas uma vez!]
                    """
                    cc = CelularesCorporativos.objects.filter(ddd=linha.dddOrigem, numero=linha.numeroOrigem)
                    if len(cc) > 0:
                        if 'vc1' in linha.localidade.lower() or 'sms' in linha.localidade.lower():
                                if linha.dddOrigem == linha.dddDestino:
                                    if int(linha.valorBruto) > 0:
                                        linha.valorBruto = '0'
                    """
                    #chamadas inválidas
                    if len(linha.numeroDestino) < 8 and linha.numeroDestino != '100' \
                        and int(linha.valorBruto) > 0:
                        if item[0] == '3' and linha.dddDestino == '10' \
                            and linha.numeroDestino == '0' and 'secretaria claro' in linha.localidade.lower():
                    ...
You can put the code you have developed so far into the body of the question?
– Leonel Sanches da Silva
Boy... that code could use a little work, huh? =)
– elias
I solved it as follows: within each iteration I ran a database query, it consumed a lot of memory, since the list was huge. So I put the result of the query in a list and checked the list with each iteration. Improved the performance and had no memory overflow!
– Cristiano Pires
You could post your solution as an answer, and accept your own answer? This would make the question more organized. Here we do not put "solved" in the title or in the body of the question (I will edit to remove ok?). Thanks.
– bfavaretto
Use the Spark apache project....
– user28277