Manipulation of Excel in python

Asked

Viewed 4,825 times

1

I would like to update excel every time the program runs, but the library only lets me write a new excel, and not update it. How can I do that?

import requests
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get("https://g1.globo.com/previsao-do-tempo/sp/sao-paulo.ghtml")



soup = BeautifulSoup(page.content, 'html.parser')


days = soup.find(class_ = "forecast-next-days__content")
today = soup.find(class_="columns medium-24 large-12 medium-centered")

nome = [today.find(class_="forecast-header__date").get_text()]

tempmax = today.find(class_="forecast-today__temperature forecast- 
today__temperature--max").get_text()
tempmin = today.find(class_="forecast-today__temperature forecast- 
today__temperature--min").get_text()

tempmax = tempmax[0:4]
tempmin = tempmin[0:4]
temp2 = [tempmax]
temp2 +=[tempmin]




nome += [n.get_text() for n in days.select(".forecast-table .forecast-next- 
days__item-label")]
temp2 += [d.get_text() for d in days.select(".forecast-table .forecast-next-days__item-value")]




temmax = temp2[0:18:2]
temmin = temp2[1:18:2]





tempo = pd.DataFrame({
   "nome": nome,
   "temperatura_max":temmax,
    "temperatura_min":temmin
   })



tempo.to_excel(r"C:\Users\nlsouza\Desktop\python5.xlsx")
  • Your question is about Pandas, could you edit your question and add the tag [tag:pandas]? (and maybe remove the tag request, your question doesn’t seem to be about that).. As for the question, you cannot create a dataframe from the existing worksheet, edit the dataframe and save it again using the same method you used?

  • This question from Soen maybe it’ll help you...

1 answer

5


Unable to "update an Excel file"

In fact, Excel itself only writes a new Excel file - it is impossible to update data on a file of the type used in Excel - xlsx - "in place". First, when talking about file manipulation, it is not possible to insert data "in the middle" of a file, as lines are inserted in a spreadsheet. And second because xlsx is a binary file, zipped - so even if you make an append in the file, it is even possible to generate back the XML file with the contents of your spreadsheet at the end of the ZIP file - only that this content will be duplicated: yours . xlsx would have twice the required size.

What Excel and almost all other programs do (the exception may be some PDF editors, and depending on the editing done), is to write a new file, and then delete the old file and rename the new file to have the same name as the old one.

It may seem like a "waste" and it "slows down", but at the tip of the pencil it’s not what happens - a modern computer can write data on disk at the rate of (at least) tens of megabytes per second - and a 100MB file would already be too big to have any practical use by Excel directly, for example.

If you’re talking about millions of records, which would start to get heavy in this model, you also have to go into storing data in a database, not into a monolithic file like one. xlsx - or even a CSV, which despite being incremental and allowing "Lazy" line processing, does not allow indexes and queries.

What can be done in this case is: read the data from the previous Excel file, and concatenate the new data at the end, then write the file again.

At the end of your show you can do this:


from pathlib import Path  # a pathlib facilita trabalhar com arquivos

import csv

caminho = Path(r"C:\Users\nlsouza\Desktop\python5.xlsx")

if not caminho.exists():  
   # primeira execução - gravar conteúdo todo:
   tempo.to_excel(caminho)
else:
   dados_anteriores = pd.read_excel(caminho)
   novos_dados = pd.concat((dados_anteriores, tempo), axis=0, ignore_index=True)
   novos_dados.to_excel(caminho)



Using CSV files

If you want to add data at the end from your table, each time the program runs, CSV files can help. But no sense for this in the data you extract - by the way you want to just replace the previous data, isn’t it? Hence it is better to rename the Excel file even.

CSV type files can be increased at the end without being rewritten: they are textual files, and have no "closing data": the syntax allows new lines to be added at the end.

The problem is that Pandas does not have support for this - the method .to_csv do Pandas will always write the entire file at once - and delete the previous file.

So you need to either use Python’s "csv" module, or, if it was a more complex dataframe, use Pandas to generate a temporary CSV and copy the data to the original CSV using some other way.

You can do that at the end of your show then:


from pathlib import Path
import csv

caminho = Path(r"C:\Users\nlsouza\Desktop\python5.xlsx")

if not caminho.exists():  
   # primeira execução - gravar cabeçalhos:
   with caminho.open("wt") as arq:
       writer = csv.writer(arq, delimiter=";")
       writer.writerow(tempo.columns)

with caminho.open("at") as arq:
   writer = csv.writer(arq, delimiter=";")
   for index, row in tempo.iterrows():
       writer.writerow(str(value) for value in row.values)

The great disadvantage of using a CSV in relation to an Excel file, is that in CSV all data are treated as strings, and it goes from the program that opens the CSV (either Excel or Pandas itself), "guess" the data type of each cell. If it’s just numbers and text strings, it usually works straight - if you have data like Dates, it can be more trouble. (And, if you really want to concatenate this data, including the date when the program was executed seems like a good idea).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.