filter dataframe by python line

Asked

Viewed 189 times

1

Hello, I own this Dataframe with hundreds of thousands of lines, I want to filter so that selection is only the desired lines, in case more current lines, follow an example of what I want:

the column DT_REFER represents the date that was updated, the column VERSAO the version of the current doc, DT_FIM_EXERC the date that the row represents

inserir a descrição da imagem aqui

I want to filter through the last DT_REFER date along with the latest version of VERSAO, returning this to me:

inserir a descrição da imagem aqui

I can’t start the search, correlate the columns.

Thank you

  • What filter expression are you using? The column DT_REFER is it already date or string? If string, convert using pd.to_datetime.

  • were not in date format, I ended up converting, thank you

2 answers

0

In the example below I created a small DataFrame containing two columns and four rows.

The date of 3/10/2020 repeats twice and it is she who will be filtered.

Creating the Dataframe

>>> import pandas as pd

>>> df = pd.DataFrame({"DT_REFER": ["1/10/2020", "2/10/2020", "3/10/2020", "3/10/2020"], 
                       "OUTRA": ["banana", "laranja", "abacaxi", "goiaba"]})

>>> df
    DT_REFER    OUTRA
0  1/10/2020   banana
1  2/10/2020  laranja
2  3/10/2020  abacaxi
3  3/10/2020   goiaba

Converting the column DT_REFER of string for datetime

>>> df["DT_REFER"] = pd.to_datetime(df['DT_REFER'], format="%d/%m/%Y")

>>> df
    DT_REFER    OUTRA
0 2020-10-01   banana
1 2020-10-02  laranja
2 2020-10-03  abacaxi
3 2020-10-03   goiaba

Filtering by date and playing the result for a new Dataframe

>>> new_df = df[(df["DT_REFER"] == "2020-10-03")]

>>> new_df
    DT_REFER    OUTRA
2 2020-10-03  abacaxi
3 2020-10-03   goiaba

UPDATE: In time, to find the latest date, just take the maximum amount (max) column DT_REFER

>>> recent_date = df['DT_REFER'].max()

>>> recent_date
Timestamp('2020-10-03 00:00:00')

UPDATE

Based on your comments, you can:

>>> df
    DT_REFER    OUTRA    VERSAO
0 2020-10-01   banana 2020-10-05
1 2020-10-02  laranja 2020-10-06
2 2020-10-03  abacaxi 2020-10-10
3 2020-10-03   goiaba 2020-10-03

>>> new_df = df[(df["DT_REFER"] == df["DT_REFER"].max()) & (df["VERSAO"] == df["VERSAO"].max())]

>>> new_df
    DT_REFER    OUTRA    VERSAO
2 2020-10-03  abacaxi 2020-10-10

Note that the operator & represents AND and the operator | (not used in this query, represents OR

Here has a reference if you want.

I hope it helps

  • Paul, this would be the difficulty, when using df['DT_REFER']. max() it returns the largest dates, but would have to be a set between the major DT_FERER and larger VERSION columns.

  • See if update answers your question.

  • Thank you Paulo, I’ll try it, thank you very much

0


You can use groupby + max

Data types in excel file:

df.info()
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   DT_REFER      6 non-null      datetime64[ns]
 1   VERSAO        6 non-null      int64         
 2   DENON_CIA     6 non-null      object        
 3   DT_FIM_EXERC  6 non-null      datetime64[ns]
 4   CD_CONTA      6 non-null      object        
 5   DS_CONTA      6 non-null      object        
 6   VL_CONTA      6 non-null      float64   

Should your column DT_REFER is not as date you can make the conversion(in case the file I used has already made this conversion):

df['DT_REFER'] = pd.to_datetime(df['DT_REFER'], format="%d/%m/%Y")

Data frame:

df
    DT_REFER    VERSAO    DENON_CIA           DT_FIM_EXERC    CD_CONTA    DS_CONTA                                    VL_CONTA
0   2011-12-31    2       BCO BRADESCO S.A    2011-12-31      2.03.05     provisões técnicas de seguros e previdência 99112321.0
1   2012-12-31    2       BCO BRADESCO S.A    2011-12-31      2.03.05     provisões técnicas de seguros e previdência 99112321.0
2   2017-12-31    2       BCO BRADESCO S.A    2017-12-31      2.06        Outros passivos                             97816824.0
3   2018-12-31    3       BCO BRADESCO S.A    2017-12-31      2.06        Outros passivos                             97816824.0
4   2017-12-31    2       BCO BRADESCO S.A    2017-12-31      2.06.01     Outros passivos                             97816824.0
5   2018-12-31    3       BCO BRADESCO S.A    2017-12-31      2.06.01     Outros passivos                             97816824.0

Grouping and showing result:

df.groupby('CD_CONTA').max().reset_index()

Exit:

    CD_CONTA    DT_REFER    VERSAO  DENON_CIA           DT_FIM_EXERC    DS_CONTA                                    VL_CONTA
0   2.03.05    2012-12-31      2    BCO BRADESCO S.A    2011-12-31      provisões técnicas de seguros e previdência 99112321.0
1   2.06       2018-12-31      3    BCO BRADESCO S.A    2017-12-31      Outros passivos                             97816824.0
2   2.06.01    2018-12-31      3    BCO BRADESCO S.A    2017-12-31      Outros passivos                             97816824.0

Edit

Grouping and showing the biggest version and date:

df.groupby(['DENON_CIA','DT_FIM_EXERC','CD_CONTA','DS_CONTA','VL_CONTA'])[['VERSAO','DT_REFER']].max().reset_index()
  • Actually by df.info() the columns dates were like Object, I used parse_dates=[0.3] in the csv reading and dtype of the columns DT_REFER, DT_FIM_EXERC are now as datetime64[ns]. In the query you mentioned is grouping the column CD_CONTA, in case it is not for it that the query is defined, but rather through the columns DT_REFER and VERSAO, together they are the ones that say the most current lines. What I want is the biggest date of DT_REFER with the largest number of VERSAO and in that returning the lines. thanks for the attention

  • And grouping by CD_CONTA changes the result you expect? Because in the example you passed this solves your problem since it will take the "maximum" value within the fields.

  • In VL_CONTA can have different values, so the columns DT_REFER and VERSAO are in the query as "main", I forgot to mention in the question

  • Add an example to your question showing this and the expected result, because the way you put VL_CONTA is directly linked to the account itself and it could not be different.

  • Then, the columns DT_REFER and VERSAO they are an update/revision of the columns: DENOM_CIA, DT_FIM_EXERC, CD_CONTA, DS_CONTA and VL_CONTA, in the column VL_CONTA may have changed or not, in the dataframe I am using has about 800mil rows, if the filter is through the columns DT_REFER and VERSAO I will have the most current lines.

  • 1

    See if with the answer below the edit does what was in mind. Hugs!

  • 1

    Thanks, apparently worked, I made a test with a smaller dataframe, thank you very much

  • This is Saul, I’m glad you solved it! If the answer has solved your problem, consider marking the answer as valid (not required but is a good practice for future users with the same problem). See how. Hug!

  • olá amigo, vim tirar um duvida com você, estou usando a linha que você me recomendou, porém estou enfrentando um problema:
estou usando a seguinte linha: 
df01.groupby(['DENOM_CIA','CD_CONTA','DS_CONTA','VL_CONTA','DT_REFER'])[['DT_INI_EXERC','DT_FIM_EXERC']]. max(). reset_index() - but returns the full dataframe, if I remove the column 'VL_CONTA' the return of the dataframe is exactly the logic of what I want, but I need the column 'VL_CONTA' to generate a new dataframe, you know what can be happening?

  • Good morning, what problem are you facing?

  • estou usando a seguinte linha: df01.groupby(['DENOM_CIA','CD_CONTA','DS_CONTA','VL_CONTA','DT_REFER'])[['DT_INI_EXERC','DT_FIM_EXERC']]. max(). reset_index() - returns the full dataframe, if I remove the column 'VL_CONTA' the return of the dataframe is exactly the logic of what I want, but I need the column 'VL_CONTA' to generate a new dataframe, you know tell what can be happening?

  • What date is on this variable DT_INI_EXERC? Do the following: move up an xlsx file with the fields you use, put a result you expect because it is easier to help you. Trying to guess what is occurring is tricky. Before you said you had to be grouped by: VERSAO and DT_REFER. Go to google drive the file and post the link here.

  • 1

    This, I had said that I needed these columns, but I’m realizing that through this way I want it becomes clearer the query, I will go up the xlsx

  • https://docs.google.com/spreadsheets/d/1t7p2y-obUPGqc2unFpUIN7-qUL3C5TLyaie8MeRoRcA/edit?usp=sharing

  • a DT_INI_EXERC.max() only in this column would also work, the problem I see is that the column 'VL_CONTA' with it in the query returns all the DATAFRAME already without it in the query returns what I want

  • See if that’s what you need: df.groupby(['DENOM_CIA','CD_CONTA','DS_CONTA','DT_REFER'])[['DT_INI_EXERC','DT_FIM_EXERC','VL_CONTA']].max().reset_index(). I added the vl_conta where I pull the maximum values.

  • as I understand it, groupby can only group equal values between columns where the distinction is only in the filter, for example: the columns 'DENOM_CIA','CD_CONTA','DS_CONTA' they contain equal values between them, whereas the column 'VL_CONTA' contains different values, but in the case of dates, the groupby can group even if they have different values, I did a test and I put in 'CD_CONTA' random values and the groupby did not group. thanks for your time available, I’m going after other modes of grouping/filters

Show 13 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.