Selecting different ranges on a giant dataframe in Rstudio

Asked

Viewed 143 times

0

I have a CSV much large with multiple stock dates and their closing prices, impossible to use Excel.

The action name is in the same date column and only appears at the beginning of the series, as shown below:

inserir a descrição da imagem aqui

I have limited knowledge in R and am in need of some Function to help me do this interval reading.

NOTE: The name of the action is always in parentheses: (AÇÃO X)

  • What are the names of the columns in your csv, @Filipe?

  • In my original (as I said above, the action name is in the "Date" column): Date ; Price

  • I answered below using generic names for the columns, but you can change, in case the c0 would turn the Data: https://answall.com/a/348663/132077

1 answer

3

One way to do this would be (I don’t know if it’s the most efficient, but it’s possible and it works):

  1. Know where your separators are DataFrame, that is, which rows have empty text values for each of the columns, and save the index in a list

  2. Rotate for each index in the list linhasVazias and separate the Series you own into subseries according to the index (each subseries containing an Action)

  3. Reformat this DataFrame containing the resulting sub-series in the new format

  4. Saving at the end df, you will receive the new information

Here’s the code where I do these operations:

linhasVazias = df[(df['c0'] == "") & (df['c1'] == "") ].index.tolist()

df_final = pd.DataFrame({'c0': [], 'c1': [], 'c2': []})
anterior = -1

for i in linhasVazias:
    # Separa a série relacionada
    temp = df[anterior+1 : i]

    # Cria a nova coluna com o nome da ação
    temp['c2'] = temp.iloc[0][0]

    # Remove a primeira linha, com o nome da ação
    temp = temp.drop([anterior+1], axis = 0)

    # Salva no novo dataFrame as linhas relacionadas
    df_final = df_final.append(temp)
    anterior = i

# Reseta os index no novo DataFrame, excluindo a coluna dos valores antigos
df_final = df_final.reset_index(drop = True)

OBS:
- here I used "C0", "C1" and "C2" to name the columns
- for your case, which has a very large DF, I do not know if the processing will be efficient, but worth the test

  • @Philip, see if it works for you, and if so, mark my answer as the correct one, please.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.