Python 3 - CSV to Excel conversion problem with list output

Question

Python 3 - CSV to Excel conversion problem with list output

Asked 5 years, 2 months ago

Viewed 81 times

0

Good afternoon. I am facing a problem while converting a CSV file to an Excel file, via openpyxl. The code structure aims to convert a PDF to Excel, and paste the PDF information into a Sheet from an already pre-formatted Excel spreadsheet.

What I tried:

import PyPDF2
import pandas as pd
from openpyxl import Workbook, load_workbook
import string
import csv
    

pdfFileObj=open(r".\pasta_460\pdf_460.pdf",'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

paginas_pdf = []
for page in pdfReader.pages:
    ddd = page.extractText()
    paginas_pdf.append(ddd)

df = pd.DataFrame(paginas_pdf)
df.to_csv(r".\pasta_460\pdf_em_csv_460.csv",encoding='utf-8')

book = load_workbook(r".\teste_template_planilha.xlsx")
writer = pd.ExcelWriter(r".\teste_template_460_modelada.xlsx", engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, 'Sheet12')
df.to_excel(writer, 'Sheet12', startrow=1, startcol=1, header=False, index=False)
writer.save()

Even worked and generated the spreadsheet with Sheet12 containing the data I proposed, however, the data comes out all in a single line of Excel, I believe it is because the data is in CSV stored inside a list (paginas_pdf), but I’m not finding solution to this problem.

I would like the data to go out on Sheet12 line by line, that is, the delimiter ":" breaks the information and puts line by line.

Follow a df.head(10) - The file only has 3 lines -

To save to excel this would be enough df.to_excel("output.xlsx") erasing all lines of book = down.

– Paulo Marques

2021/02/03 at 21:35
Even using df.to_excel("output.xlsx") the data comes out all on a single line. I’ve tried df.replace('\n', ' ') and also keeps the structure of single lines, as if it were an array. The question of the code of book = down is the intention to create the spreadsheet in the same spreadsheet that I indicated as a template.

– gabriel_santos

2021/02/03 at 21:48
Run df.head() and update the post

– Paulo Marques

2021/02/03 at 22:16
@Paulomarques, see if with the images it is easier to identify the problem. I appreciate the attention

– gabriel_santos

2021/02/03 at 22:41
Before the line paginas_pdf.append(ddd) add items = ddd.split("\n") then replace paginas_pdf.append(ddd) for paginas_pdf += items. Must solve.

– Paulo Marques

2021/02/03 at 23:02
You’re a genius! ahahha Thanks for the help. The solution worked perfectly.

– gabriel_santos

2021/02/04 at 00:28

Show 1 more comment

1 answer

Browser other questions tagged python excel pandas pdf csv

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2021-02-04T19:11:56+00:00

Documenting the solution suggested in the comment

import PyPDF2
import pandas as pd
from openpyxl import Workbook, load_workbook
import string
import csv
    

pdfFileObj=open(r".\pasta_460\pdf_460.pdf",'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

paginas_pdf = []
for page in pdfReader.pages:
    ddd = page.extractText()
    items = ddd.split("\n")   # linha adicionada
    paginas_pdf += items      # linha adicionada
    # paginas_pdf.append(ddd) <- linha retirada

df = pd.DataFrame(paginas_pdf)
df.to_csv(r".\pasta_460\pdf_em_csv_460.csv",encoding='utf-8')

book = load_workbook(r".\teste_template_planilha.xlsx")
writer = pd.ExcelWriter(r".\teste_template_460_modelada.xlsx", engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, 'Sheet12')
df.to_excel(writer, 'Sheet12', startrow=1, startcol=1, header=False, index=False)
writer.save()