From that your column timestamp
are dates. If they are not, you can transform them like this: df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S')
One solution is to create another Data Frame df_intervalos
, with the columns data_fim, data_inicio, valor
and go through the original dataframe line by line, seeing when the sequence changes and filling the df_intervalos
.
It can be filled so:
# crio o df_intervalos
df_intervalos = pd.DataFrame(data = {'data_inicio':[], 'data_fim':[], 'valor':[]})
data_inicio = df['timestamp'].values[0]
data_fim = df['timestamp'].values[0]
sequencia = df['valor'].values[0]
for index, linha in df[1:].iterrows(): # percorro linha a linha
if linha['valor'] != sequencia: # Se o valor for diferente, encerra a sequencia
df_intervalos = df_intervalos.append({'data_inicio' : data_inicio,
'data_fim' : data_fim,
'valor': sequencia} , ignore_index=True)
data_inicio = linha['timestamp']
sequencia = linha['valor']
data_fim = linha['timestamp']
# Adiciona a ultima sequência
df_intervalos = df_intervalos.append({'data_inicio' : data_inicio,
'data_fim' : data_fim,
'valor': sequencia} , ignore_index=True)
Then we take the time of each sequence by subtracting the start and end dates of each sequence:
df_intervalos['tempo_calculado'] = (df_intervalos['data_fim'] - df_intervalos['data_inicio'])
The result is something like:
data_fim data_inicio valor tempo_calculado
0 2019-04-23 16:15:34.142540 2019-04-23 16:14:34.142540 0.0 00:01:00
1 2019-04-23 16:16:34.142540 2019-04-23 16:16:34.142540 1.0 00:00:00
2 2019-04-23 16:17:34.142540 2019-04-23 16:17:34.142540 0.0 00:00:00
3 2019-04-23 16:19:34.142540 2019-04-23 16:18:34.142540 1.0 00:01:00
4 2019-04-23 16:20:34.142540 2019-04-23 16:20:34.142540 0.0 00:00:00
Now, to know the maximum, minimum and average sequence duration 1
, for example:
>>> df_intervalos[df_intervalos['valor']==1]['tempo_calculado'].max()
Timedelta('0 days 00:01:00')
>>> df_intervalos[df_intervalos['valor']==1]['tempo_calculado'].min()
Timedelta('0 days 00:00:00')
>>> df_intervalos[df_intervalos['valor']==1]['tempo_calculado'].mean()
Timedelta('0 days 00:00:30')
Or even if you want the whole line:
>>> df_intervalos[df_intervalos['valor']==1][df_intervalos['tempo_calculado'] ==df_intervalos['tempo_calculado'].max()]
data_fim data_inicio valor tempo_calculado
3 2019-04-23 16:19:34.142540 2019-04-23 16:18:34.142540 1.0 00:01:00
Man, thank you so much for your commitment to respond with such quality.. your approach worked very well for me, especially in order to generate another dataframe with the values I needed... in this way, through some adjustments and necessary modeling I solved my problem. Once again Thank you !
– Diego Nascimento