Random Forest with very high accuracy

Asked

Viewed 193 times

1

I’m working with this dataset And I applied Forest Random to create a price forecast model, but the accuracy of the model is getting too high, so I’m suspicious if something is wrong. Apparently Train and test are different, so it was not to give such a high accuracy... have some mistake?

print(score2) and print(accu2):

0.9981901132115226

[0.99086244 0.99562853 0.99551529 0.9988478 0.99997931]

#Random forest
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,ExtraTreesRegressor,GradientBoostingRegressor,BaggingRegressor
rf = RandomForestRegressor()

#conjunto sem data nem id
df2 = df.drop(['date', 'id'], axis=1)

#tira o price do df2 e coloca em x
x = df2.drop(['price'], axis=1)
#coloca em y apenas o price
y = df2['price']

x_train, x_test = train_test_split(x,test_size=0.2, random_state=42)
y_train, y_test = train_test_split(y, test_size=0.2, random_state=42)

print(x_train.count())
print(x_test.count())

print(x_train.head(2))
print(x_test.head(2))


rf.fit(x_train,y_train)

score2 = rf.score(x_test,y_test)
accu2 = cross_val_score(rf,x_train,y_train,cv=5)

print("____ Random Forest Regressor____\n")
print(score2)
print(accu2)

It’s also like this when I apply Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import explained_variance_score


#conjunto sem data nem id
df2 = df.drop(['date', 'id'], axis=1)

#tira o price do df2 e coloca em x
x = df2.drop(['price'], axis=1)
#coloca em y apenas o price
y = df2['price']

#x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42)
x_train, x_test, y_train, y_test = cross_validation.train_test_split(x, y ,test_size=0.2)


gb = GradientBoostingRegressor(n_estimators=1000)
gb.fit(x_train,y_train)

score4 = gb.score(x_test,y_test)
pred = gb.predict(x_test)
exp_est = explained_variance_score(pred, y_test)

print("exp_est: ") 
print(exp_est)

#accu4 = cross_val_score(gb,x_train,y_train,cv=5)
print("____ Gradient Boosting Regressor____\n")
print(score4)
print(accu4)

0.998862149174232

[0.99741288 0.9989814 0.99979751 0.99906217 0.9999443 ]

  • Are you sure Voce doesn’t make any changes to the dataset? I I created a notebook in Kaggle with its code and accuracy goes well below what Voce is posting.

1 answer

1


Leila, strangely I couldn’t generate the same accuracy and score values as you, with the same code and dataset (lowered the dataset kaggle).

To the dataset found the following score and acc values:

0.8388415464783893

[0.85563895 0.86273709 0.8589165 0.87741256 0.85294125]

See if you’re not forcing the overtraining, reusing "materials" from other trainings previously performed in the same code... (I am assuming that you have not changed the dataset of Kaggle, but this is a point to note).

Below the full code I used (I took the liberty to remove the unused references):

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
rf = RandomForestRegressor()
import pandas as pd

df = pd.read_csv('kc_house_data.csv')

#conjunto sem data nem id
df2 = df.drop(['date', 'id'], axis=1)

#tira o price do df2 e coloca em x
x = df2.drop(['price'], axis=1)
#coloca em y apenas o price
y = df2['price']

x_train, x_test = train_test_split(x,test_size=0.2, random_state=42)
y_train, y_test = train_test_split(y, test_size=0.2, random_state=42)

print(x_train.count())
print(x_test.count())

print(x_train.head(2))
print(x_test.head(2))

rf.fit(x_train,y_train)

score2 = rf.score(x_test,y_test)
accu2 = cross_val_score(rf,x_train,y_train,cv=5)

print("____ Random Forest Regressor____\n")
print(score2)
print(accu2)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.