Join columns in python?

Asked

Viewed 4,131 times

1

Hello. I have this file (filing cabinet) where I need to merge the columns yyyy, mm, dd, hour (year, month, day and hour) into a single column, and stay in this format 20180101010000, conclusion: year+ month (with two digits)+day (with two digits)+ hour (with another 00 at the end), 2018 + 01+ 01+ 010000.

But all I got was 2018.01.01.01. I needed to get these points out and fix some things. The code so far:

arq_csv = pd.read_csv('arquivo.csv')
csv_date_list = []
for index, rows in arq_csv.iterrows():
    csv_date_list.append(str(rows[' yyyy'])+str(rows[' mm'])+str(rows[' dd'])+str(int(rows[' hour'])))
print(csv_date_list)

P.S.: The doc, when putting the column between quotes, e.g. 'yyyy', has to put a space before the name, otherwise the column will not be read.

Thank you.

3 answers

3

Good afternoon Helena, You can format the string in any format you want using the str.format() which in this case is a method already contained in the string type. In this case you can use it as follows:

csv_date_list.append("{}{}{}{}".format(rows[' yyyy'],rows[' mm'],rows[' dd'],rows[' hour']))

This way you can define the format that is most convenient for you.

Reference

  • 1

    this creates the list, but does not help to incorporate the dates created in the dataframe for continuity of analysis. The format method although working is very impractical -from Python 3.6 it is much more convenient to use the f-strings for that kind of expression.

1


Rather than simply concatenating column data into a Python list, where you would have timestamps,.

In this case, it is best to use the method apply - it can call a function row by row from your dataframe, and aggregate the values returned for each row in a Series. This series will share the Index with your original dataframe, and can be concatenated as an extra column. (And then, you can delete the columns with separate date elements).

And while we’re at it, the dataframe, unlike a "csv" file where "everything is text" can contain more elaborate objects - like datetimes, which contains a "timestamp" data with date, hours, minutes - which can be ordered, taking into account daylight savings time and time zone-time, subtracted from other date-time values to find duration, and so on.

If the function to be applied returns a datetime object, the pandas automatically creates a series with that content:

from datetime import datetime
def processa(linha):
   # transformar as colunas desejadas em uma lista de valores inteiros:
   valores = [int(val) for val in (linha[" yyyy"], linha[" mm"], linha[" dd"], linha[" hour"], linha[" min"]
   # criar objeto datetime:
   # O construtor do python "datetime" recebe na ordem os valores 
   # para ano, mes, dia, horas e minutos - o operador "*"
   # desempacota esses argumentos, que estão em uma lista, na chamada:
   return datetime(*valores)

# Ler o seu dataframe:
df = pd.read_csv("B116353.csv")
# criar a série com as datas e horas:
timestamps = df.apply(processa, axis=1)
timstamps.name = "timestamps" 

# Criar um novo dataframe com as colunas de interesse - 
# descobrir indice da coluna apos " min":
remainder_start = list(df.columns).index(" min")

new_df = pd.concat(
    (df[["id_argos", " id_wmo"]],
     timestamps,
     df[list(df.columns)[remainder_start + 1: ]
    ),
    axis=1
)

Ready - now you have the "timestamp" column with a datetime object combining the numbers of 5 columns - and you can proceed with the processing of your dataframe.

The penultimate line of the "Concat" call uses "pure" Python (that is, no pandas) - to select the names of all columns in front of " min" without them needing to be typed - these names are passed as a list of strings as an index to the dataframe, And that selects those columns. The "Concat" call then uses the first two columns of the original frame, the time series we created, and all the remaining columns to create a new dataframe.

0

I got it here.

csv_date_list.append(str(int(rows[" yyyy"]))+str(int(rows[' mm'])).zfill(2)+str(int(rows[' dd'])).zfill(2)+str(int(rows[' hour']))+'0000')

Browser other questions tagged

You are not signed in. Login or sign up in order to post.