2
Good afternoon everyone, I am doing an exploratory analysis of data in Python using the classic DF kc_house_data, and I came across the following problem.
I would like to make some comparisons to see whether or not there is a statistical difference between two sets of data:
d1 = data['price_m2_living'].loc[data['floors'] % 1 != 0]
d2 = data['price_m2_living'].loc[data['floors'] % 1 == 0]
First of all I had to check the normality of the sets, for that I made a histogram:
I found that the histogram does not resemble a normal curve, so I used skew and kstest in both df:
print(skew(d1),skew(d2))
0.7013599441290586 1.3217600229412982
print(kstest(d1, 'norm'), kstest(d2,'norm'))
Kstestresult(statistic=1.0, Pvalue=0.0) Kstestresult(statistic=1.0, Pvalue=0.0)
That is, these two dfs cannot be considered normal, so I would have two options; 1- normalize the curves, 2- check if there is difference between the data using some test for 'non-normal' data (in this case, Wilcoxon)
For this I tried using scypy.stats.lognormal, but I’m having difficulty, I don’t know if this is the right way to normalize the curve, I’ve tried some variations, but never at the end my histogram seems to have normalized. And even if my skew(x) shows a lower value than the D1 and D2 initial my value p of the Kolmogorov-Smirnov test is always 0.
df1 = lognorm.pdf(d1, d1.std())
df2 = lognorm.pdf(d1, d1.std())
I thought this would be the way to normalize these curves, but it’s not working. While the Wilcoxon test, as far as I can tell, only applies to curves with the same number n of values.
If anyone can help me, I would be very grateful, I’m already a few days stuck in it.
By the type of asymmetry you can log the data. numpy has log function
– lmonferrari
See if that helping...
– Paulo Marques