Calculate how many Nan in each Python time

Asked

Viewed 397 times

1

Good day, I have one DF where it has NaN And I need to count how many Nan I have at every hour. My df has a 31-day period. I believe I should use Groupby, but I’m not finding a way.

import numpy as np
import pandas as pd

df = pd.read_hdf('./dados.hd5')

df.head()

Year_DoY_Hour          Temperatura
2016-01-01 00:00:00    NaN
2016-01-01 01:00:00    22.445700
2016-01-01 02:00:00    22.388300
2016-01-01 03:00:00    22.400000
2016-01-01 04:00:00    NaN
2016-01-01 05:00:00    22.133900
2016-01-01 06:00:00    21.948999
2016-01-01 07:00:00    21.787901
2016-01-01 08:00:00    21.610300
2016-01-01 09:00:00    NaN
.
.
.
2016-31-01 00:00:00    NaN
2016-31-01 01:00:00    21.310800
2016-31-01 02:00:00    22.910700
2016-31-01 03:00:00    23.810600
2016-31-01 04:00:00    NaN

We can see that on schedule 00:00:00 we have 2 NaN and in the 03:00:00 we have 1 NaN and in the 04:00:00 we have 2 NaN

  • You can put the file somewhere that you can download and test. I’m not familiar with hd5 but I think I can help you with groupBy

  • Could be . csv, that’s no problem. The problem is I apply this in df.

  • Okay, I’m gonna go with csv and see if I can help you

2 answers

4


The answer you have is correct, but does not use the Pandas (which was what you used and asked). Here’s one that uses the Pandas then. :)

Basically the idea is to make the grouping defining labels that use the date (year, month and day) and only the hour. The rest (minutes and seconds) you ignore. That way, each grouping will have all entries of the same time, and then just count the number of nulls (NaN). An important issue is that the grouping function expects a function (or lambda, as in the example) that will receive the table index. Therefore, it makes sense to convert the date/time column to DateTime first, and then turn it into table index. So you can process this value directly into function/lambda.

Here is an example of code (documented step-by-step):

import pandas as pd

# Somente para a leitura dos dados de teste
df = pd.read_csv('teste.csv', sep=',')

# Converte os dados da coluna data/hora para o formato DateTime
df['Year_DoY_Hour'] = pd.to_datetime(df['Year_DoY_Hour'], format='%Y-%d-%m %H:%M:%S')

# Converte a coluna da data/hora para o índice da tabela
df = df.set_index('Year_DoY_Hour')

# Imprime os dados originais
print('Original:')
print('-' * 40)
print(df)

# Agrupa os dados pelo índice em rótulos do tipo "ano-dia-mês hora:00:00"
grouped = df.groupby(
                     lambda index: '{:04d}-{:02d}-{:02d} {:02d}:00:00' \
                     .format(index.year, index.day, index.month, index.hour)
                    )

# Faz a contagem para cada agrupamento e imprime
print('\n')
print('Contagem de NaNs por hora:')
print('-' * 40)
for label, group in grouped:
    # Conta o número de NaNs (nulos) na coluna `Temperatura` do grupo atual
    numNaNs = len(group[group['Temperatura'].isnull()])
    # Imprime a contagem do grupo (com o respectivo rótulo)
    print('{}: {}'.format(label, numNaNs))

Using as input data a CSV with:

Year_DoY_Hour,Temperatura
2016-01-01 00:00:00,NaN
2016-01-01 00:01:00,NaN
2016-01-01 00:18:22,22.388300
2016-01-01 00:55:00,NaN
2016-01-01 01:00:00,22.445700
2016-01-01 02:00:00,22.388300
2016-01-01 03:00:00,22.400000
2016-01-01 04:00:00,NaN
2016-01-01 05:00:00,22.133900
2016-01-01 06:00:00,21.948999
2016-01-01 07:00:00,21.787901
2016-01-01 08:00:00,21.610300
2016-01-01 09:00:00,NaN
2016-01-01 09:07:18,NaN
2016-01-01 09:44:00,21.310800
2016-31-01 02:00:00,22.910700
2016-31-01 03:00:00,23.810600
2016-31-01 04:00:00,NaN

The result is:

Original:
----------------------------------------
                     Temperatura
Year_DoY_Hour
2016-01-01 00:00:00          NaN
2016-01-01 00:01:00          NaN
2016-01-01 00:18:22    22.388300
2016-01-01 00:55:00          NaN
2016-01-01 01:00:00    22.445700
2016-01-01 02:00:00    22.388300
2016-01-01 03:00:00    22.400000
2016-01-01 04:00:00          NaN
2016-01-01 05:00:00    22.133900
2016-01-01 06:00:00    21.948999
2016-01-01 07:00:00    21.787901
2016-01-01 08:00:00    21.610300
2016-01-01 09:00:00          NaN
2016-01-01 09:07:18          NaN
2016-01-01 09:44:00    21.310800
2016-01-31 02:00:00    22.910700
2016-01-31 03:00:00    23.810600
2016-01-31 04:00:00          NaN


Contagem de NaNs por hora:
----------------------------------------
2016-01-01 00:00:00: 3
2016-01-01 01:00:00: 0
2016-01-01 02:00:00: 0
2016-01-01 03:00:00: 0
2016-01-01 04:00:00: 1
2016-01-01 05:00:00: 0
2016-01-01 06:00:00: 0
2016-01-01 07:00:00: 0
2016-01-01 08:00:00: 0
2016-01-01 09:00:00: 2
2016-31-01 02:00:00: 0
2016-31-01 03:00:00: 0
2016-31-01 04:00:00: 1
  • 2

    It’s true Luiz, it would really be more appropriate with pandas, but I don’t have your experience in this module

  • 1

    @Miguel No problem. Your solution is also cool and serves as a reference/alternative. :)

  • 1

    A question Luiz, which defines that it is only by hour/min/sec that wants to group, and not by year/day/month too?

  • 1

    @Miguel In the lambda function used to make the grouping. It takes the index of each row of the table, and returns a single label (label) indicating the grouping. As the index is a DateTime, she defines the label as '{:04d}-{:02d}-{:02d} {:02d}:00:00'.format(index.year, index.day, index.month, index.hour) (note the ":00:00" there at the end, to ignore minutes and seconds - was not necessary, but so goes also for the final label). Thus not only the time is considered individually, but on different days/months/years also. :)

  • 2

    Yes, thank you Luiz I later tested this and I could realize that it was not done grouping only by hour. Thank you

2

I’m not sure what the format is really like in hd5 (I researched but I couldn’t figure it out), if it’s like what you put instead of doing ...split(',') as I do in the examples below ....split(' ') (4 spaces). Csv format I used for testing is:

2016-01-01 00:00:00,    NaN
2016-01-01 01:00:00,    22.445700
2016-01-01 02:00:00,    22.388300
2016-01-01 03:00:00,    22.400000
2016-01-01 04:00:00,    NaN
2016-01-01 05:00:00,    22.133900
2016-01-01 06:00:00,    21.948999
2016-01-01 07:00:00,    21.787901
...

With groupby you can do so:

from itertools import groupby

with open('tests.csv', 'r') as f:
    dados = [(l.split(',')[0], l.split(',')[1].strip()) for l in f]
print(dados) # [('2016-01-01 00:00:00', 'NaN'), ('2016-01-01 01:00:00', '22.445700'), ('2016-01-01 02:00:00', '22.388300'), ('2016-01-01 03:00:00', '22.400000'), ...]
dados_sort = sorted((k.split()[1], v) for k, v in dados) # importante
for hora, group in groupby(dados_sort, key=lambda x: x[0]):
    group = list(group)
    if any(v == 'NaN' for k, v in group):
        print('Existem {} NaN na hora {}'.format(len(group), hora))

Program output for data you give:

There are 2 Nan on time 00:00:00
There are 2 Nan at the time 04:00:00
There are 1 Nan on time 09:00:00



But honestly I would not do so in this case (unless I really had to), I would do so:

from collections import Counter

dados = {}
with open('tests.csv', 'r') as f:
    for l in f:
        hora, val = l.split(',') # hora e temperatura, deves ja ter isto devidido por linha no teu caso
        dados.setdefault(val.strip(), []).append(hora.split(' ')[1])
print(dados) # {'22.388300': ['02:00:00'], '23.810600': ['03:00:00'], '21.610300': ['08:00:00'], '22.400000': ['03:00:00'], '21.948999': ['06:00:00'], 'NaN': ['00:00:00', '04:00:00', '09:00:00', '00:00:00', '04:00:00'], '22.910700': ['02:00:00'], '22.445700': ['01:00:00'], '21.787901': ['07:00:00'], '22.133900': ['05:00:00'], '21.310800': ['01:00:00']}
print(Counter(dados['NaN']))

{'00:00:00': 2, '04:00:00': 2, '09:00:00': 1}

Or, if you don’t need to store all the values you can just:

from collections import Counter

list_NaN = []
with open('tests.csv', 'r') as f:
    for l in f:
        hora, val = l.split(',')
        if val.strip() == 'NaN':
            list_NaN.append(hora.split(' ')[1])
print(Counter(list_NaN))

{'00:00:00': 2, '04:00:00': 2, '09:00:00': 1}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.