1
I’m working with this dataset And I applied Forest Random to create a price forecast model, but the accuracy of the model is getting too high, so I’m suspicious if something is wrong. Apparently Train and test are different, so it was not to give such a high accuracy... have some mistake?
print(score2) and print(accu2):
0.9981901132115226
[0.99086244 0.99562853 0.99551529 0.9988478 0.99997931]
#Random forest
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,ExtraTreesRegressor,GradientBoostingRegressor,BaggingRegressor
rf = RandomForestRegressor()
#conjunto sem data nem id
df2 = df.drop(['date', 'id'], axis=1)
#tira o price do df2 e coloca em x
x = df2.drop(['price'], axis=1)
#coloca em y apenas o price
y = df2['price']
x_train, x_test = train_test_split(x,test_size=0.2, random_state=42)
y_train, y_test = train_test_split(y, test_size=0.2, random_state=42)
print(x_train.count())
print(x_test.count())
print(x_train.head(2))
print(x_test.head(2))
rf.fit(x_train,y_train)
score2 = rf.score(x_test,y_test)
accu2 = cross_val_score(rf,x_train,y_train,cv=5)
print("____ Random Forest Regressor____\n")
print(score2)
print(accu2)
It’s also like this when I apply Gradient Boosting Regressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import explained_variance_score
#conjunto sem data nem id
df2 = df.drop(['date', 'id'], axis=1)
#tira o price do df2 e coloca em x
x = df2.drop(['price'], axis=1)
#coloca em y apenas o price
y = df2['price']
#x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42)
x_train, x_test, y_train, y_test = cross_validation.train_test_split(x, y ,test_size=0.2)
gb = GradientBoostingRegressor(n_estimators=1000)
gb.fit(x_train,y_train)
score4 = gb.score(x_test,y_test)
pred = gb.predict(x_test)
exp_est = explained_variance_score(pred, y_test)
print("exp_est: ")
print(exp_est)
#accu4 = cross_val_score(gb,x_train,y_train,cv=5)
print("____ Gradient Boosting Regressor____\n")
print(score4)
print(accu4)
0.998862149174232
[0.99741288 0.9989814 0.99979751 0.99906217 0.9999443 ]
Are you sure Voce doesn’t make any changes to the dataset? I I created a notebook in Kaggle with its code and accuracy goes well below what Voce is posting.
– Begnini