Python 3 - CSV to Excel conversion problem with list output

Asked

Viewed 81 times

0

Good afternoon. I am facing a problem while converting a CSV file to an Excel file, via openpyxl. The code structure aims to convert a PDF to Excel, and paste the PDF information into a Sheet from an already pre-formatted Excel spreadsheet.

What I tried:

import PyPDF2
import pandas as pd
from openpyxl import Workbook, load_workbook
import string
import csv
    

pdfFileObj=open(r".\pasta_460\pdf_460.pdf",'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

paginas_pdf = []
for page in pdfReader.pages:
    ddd = page.extractText()
    paginas_pdf.append(ddd)

df = pd.DataFrame(paginas_pdf)
df.to_csv(r".\pasta_460\pdf_em_csv_460.csv",encoding='utf-8')

book = load_workbook(r".\teste_template_planilha.xlsx")
writer = pd.ExcelWriter(r".\teste_template_460_modelada.xlsx", engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, 'Sheet12')
df.to_excel(writer, 'Sheet12', startrow=1, startcol=1, header=False, index=False)
writer.save()

Even worked and generated the spreadsheet with Sheet12 containing the data I proposed, however, the data comes out all in a single line of Excel, I believe it is because the data is in CSV stored inside a list (paginas_pdf), but I’m not finding solution to this problem.

I would like the data to go out on Sheet12 line by line, that is, the delimiter ":" breaks the information and puts line by line.

Follow a df.head(10) - The file only has 3 lines - inserir a descrição da imagem aqui

  • To save to excel this would be enough df.to_excel("output.xlsx") erasing all lines of book = down.

  • Even using df.to_excel("output.xlsx") the data comes out all on a single line. I’ve tried df.replace('\n', ' ') and also keeps the structure of single lines, as if it were an array. The question of the code of book = down is the intention to create the spreadsheet in the same spreadsheet that I indicated as a template.

  • Run df.head() and update the post

  • @Paulomarques, see if with the images it is easier to identify the problem. I appreciate the attention

  • Before the line paginas_pdf.append(ddd) add items = ddd.split("\n") then replace paginas_pdf.append(ddd) for paginas_pdf += items. Must solve.

  • You’re a genius! ahahha Thanks for the help. The solution worked perfectly.

Show 1 more comment

1 answer

0


Documenting the solution suggested in the comment

import PyPDF2
import pandas as pd
from openpyxl import Workbook, load_workbook
import string
import csv
    

pdfFileObj=open(r".\pasta_460\pdf_460.pdf",'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

paginas_pdf = []
for page in pdfReader.pages:
    ddd = page.extractText()
    items = ddd.split("\n")   # linha adicionada
    paginas_pdf += items      # linha adicionada
    # paginas_pdf.append(ddd) <- linha retirada

df = pd.DataFrame(paginas_pdf)
df.to_csv(r".\pasta_460\pdf_em_csv_460.csv",encoding='utf-8')

book = load_workbook(r".\teste_template_planilha.xlsx")
writer = pd.ExcelWriter(r".\teste_template_460_modelada.xlsx", engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, 'Sheet12')
df.to_excel(writer, 'Sheet12', startrow=1, startcol=1, header=False, index=False)
writer.save()

Browser other questions tagged

You are not signed in. Login or sign up in order to post.