2
Problem using the dataset of the book "Hands-On Machine Learning with Scikit-Learn and Tensorflow"
https://github.com/ageron/handson-ml
dataset of house prices. Objective: to create a model of house prices forecast.
I got the histogram below:
You can see that there is an "unbalancing" of house prices.
Do "Resampling" with 2 classes/rating is relatively easy. Now how to do the same in the above problem, where the class is not binary? Each house value is a class...
Source code (is a notebook jupyter):
#!/usr/bin/env python
# coding: utf-8
import sys #ver python path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
housing = pd.read_csv('/Dados/Estudo_ML/handson-ml-master/datasets/housing/housing.csv',dtype={"srcip":object ,})
housing.head(20)
# # Análise de dados exploratória e Pré-processamento
#
#quero prever "median_house_value"
vars = ['longitude', 'latitude','housing_median_age','total_rooms','total_bedrooms',
'population','households','median_income']
sns.pairplot(housing)
housing['median_house_value']#a classe!
housing
housing.isna().sum()
# Podemos ver que os dados faltantes estão exclusivamente na coluna 'total_bedrooms'
housing.isnull().sum()
housing['total_bedrooms']
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(housing.isnull(),yticklabels=False,cbar=False,cmap='viridis',ax=ax)# dados faltantes:concentracao nas colunas
housing['median_house_value'].hist()#classes desbalanceadas?????? SIMMMMMMMMMMMMMMMMMMMMMM
sns.boxplot(x='median_house_value',y='total_bedrooms',data =housing )
# Criando variáveis do tipo dummies!
housing.columns
housing.iloc[:,housing.columns.isin(('ocean_proximity','ocean_proximity')) ]##ocean_proximity : é uma variável categórica
pd.get_dummies(housing['ocean_proximity']) #Dummy
# # Escalonamento de variáveis
# (importante para alg que calculam distâncias como a euclidiana (KNN...)
import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
type(housing['ocean_proximity'])
scaler.fit_transform(housing[['median_income']])
# # Normalizando
min_max_scaler = MinMaxScaler()
min_max_scaler.fit_transform(housing[['population']]) #normalize the columns of this dataframe where each value is between 0 and 1
# # Balanceamento (Resampling) e escalonamento do datset:
##Fazer balancemento no treinamento para equilibrar a qtd de amostras de casa de cada categoria de preco
housing[housing['median_house_value']>=100000].count()
housing[housing['median_house_value']>=500000].count()
housing['median_house_value'].hist()#classes desbalanceadas!
# # Abordagem escolhida: Down-sample Majority Class
from sklearn.utils import resample
# Separate majority and minority classes
df_majority_down = housing[housing.median_house_value<=300000]
df_minority_down = housing[housing.median_house_value>=400000]
Are you making an algorithm to predict the price: does this mean that it is a regression algorithm (it spits out a number that is the predicted price) or a rating (it spits out a number that is a price range Bucket)? If it’s the price, why do you need to balance? You don’t have classes, just final values, right?
– Victor Capone
@Victor Capone: The doubt starts there... Does it make sense to balance in Brazil? is that there are price ranges that have few houses...
– Ed S