Filler

Asked

Viewed 576 times

0

Good morning, I have a data frame with air temperature, global radiation and CO2, but my CO2 data are with NaN and I need to find data in other "lines" with similarity to fill Nan by CO2 data.

import numpy as np
import pandas as pd

df = pd.read_hdf('./dados.hd5')

df.head()

Year_DoY_Hour          Tair        Rg       CO2
2016-01-01 00:00:00    22.651600   0.000    NaN
2016-01-01 00:30:00    22.445700   0.000    6.43
2016-01-01 01:00:00    22.388300   0.000    5.03
2016-01-01 01:30:00    22.400000   0.000    3.05
2016-01-01 02:00:00    22.257099   0.000    NaN
2016-01-01 02:30:00    22.133900   0.000    2.50
2016-01-01 03:00:00    21.948999   0.000    1.58
2016-01-01 03:30:00    21.787901   0.000    0.89
2016-01-01 04:00:00    21.610300   0.000    1.58
2016-01-01 04:30:00    21.619400   0.000    NaN
  • What would be these lines "with similarity"? Have global temperature and radiation with close values?

  • I have a large df, so I need to find the line that most resembles to replace the value of CO2 in the Nan file.

  • And what is the concept of similarity between the lines you want?

  • Where you do not have CO2 data look for similar data with Tair and Rg, so that in that line the CO2 data is filled with a similar data.

  • But in your disk file - . hd5 - the values are the same "Nan" text? Or did it just happen that Pandas could not read some of the values due to a different format?

  • what is the size of the data series? You can do it with Sort with a function key which sort by the distance of a given value - but would have to sort once for each number you want to recover. Another way is to put Tair as Indice in a binary tree structure.

  • Yes, the text values are "Nan".

  • The size of my data series is 1 year with data every half hour.

Show 3 more comments

1 answer

1

It seems to me that, from the description of your problem, that you are facing a predictive problem, and more precisely, it is the problem of fixing the incomplete values of a data set using the information contained in it. It is a common and known problem in the data science literature and the suggestions, in general, are to treat the problem as a normal classification or regression problem where the target variables will be the variables with incomplete values which you want to complete.

There are other recommended ways in the literature to treat incomplete values, for example summary techniques here. However, since you have already decided to try to predict the incomplete values by similarity, this link brings an easy example of how to implement a Linear Discriminative Analysis (Linear Descriminant Analysis) model for this purpose, using the machine learning library Scikit-Learn. I transcribe the specific part of the code below:

from pandas import read_csv
import numpy
from sklearn.preprocessing import Imputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# fill missing values with mean column values
imputer = Imputer()
transformed_X = imputer.fit_transform(X)
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy')
print(result.mean())

Browser other questions tagged

You are not signed in. Login or sign up in order to post.