It seems to me that, from the description of your problem, that you are facing a predictive problem, and more precisely, it is the problem of fixing the incomplete values of a data set using the information contained in it. It is a common and known problem in the data science literature and the suggestions, in general, are to treat the problem as a normal classification or regression problem where the target variables will be the variables with incomplete values which you want to complete.
There are other recommended ways in the literature to treat incomplete values, for example summary techniques here. However, since you have already decided to try to predict the incomplete values by similarity, this link brings an easy example of how to implement a Linear Discriminative Analysis (Linear Descriminant Analysis) model for this purpose, using the machine learning library Scikit-Learn. I transcribe the specific part of the code below:
from pandas import read_csv
import numpy
from sklearn.preprocessing import Imputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# fill missing values with mean column values
imputer = Imputer()
transformed_X = imputer.fit_transform(X)
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy')
print(result.mean())
What would be these lines "with similarity"? Have global temperature and radiation with close values?
– Woss
I have a large df, so I need to find the line that most resembles to replace the value of CO2 in the Nan file.
– Lucas Fagundes
And what is the concept of similarity between the lines you want?
– Woss
Where you do not have CO2 data look for similar data with Tair and Rg, so that in that line the CO2 data is filled with a similar data.
– Lucas Fagundes
But in your disk file - . hd5 - the values are the same "Nan" text? Or did it just happen that Pandas could not read some of the values due to a different format?
– jsbueno
what is the size of the data series? You can do it with Sort with a function
key
which sort by the distance of a given value - but would have to sort once for each number you want to recover. Another way is to put Tair as Indice in a binary tree structure.– jsbueno
Yes, the text values are "Nan".
– Lucas Fagundes
The size of my data series is 1 year with data every half hour.
– Lucas Fagundes