How to balance classes in a machine Learning regression problem with Python?

Asked

Viewed 1,020 times

2

Problem using the dataset of the book "Hands-On Machine Learning with Scikit-Learn and Tensorflow"

https://github.com/ageron/handson-ml

dataset of house prices. Objective: to create a model of house prices forecast.

I got the histogram below:

inserir a descrição da imagem aqui

You can see that there is an "unbalancing" of house prices.

Do "Resampling" with 2 classes/rating is relatively easy. Now how to do the same in the above problem, where the class is not binary? Each house value is a class...

Source code (is a notebook jupyter):

 #!/usr/bin/env python
# coding: utf-8

import sys #ver python path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


get_ipython().run_line_magic('matplotlib', 'inline')

housing = pd.read_csv('/Dados/Estudo_ML/handson-ml-master/datasets/housing/housing.csv',dtype={"srcip":object ,})

housing.head(20)

# # Análise de dados exploratória e Pré-processamento
# 

#quero prever "median_house_value"


vars = ['longitude', 'latitude','housing_median_age','total_rooms','total_bedrooms',
       'population','households','median_income']

sns.pairplot(housing)

housing['median_house_value']#a classe!

housing

housing.isna().sum()
# Podemos ver que os dados faltantes estão exclusivamente na coluna 'total_bedrooms'


housing.isnull().sum()

housing['total_bedrooms']


fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(housing.isnull(),yticklabels=False,cbar=False,cmap='viridis',ax=ax)# dados faltantes:concentracao nas colunas

housing['median_house_value'].hist()#classes desbalanceadas?????? SIMMMMMMMMMMMMMMMMMMMMMM


sns.boxplot(x='median_house_value',y='total_bedrooms',data =housing )


# Criando variáveis do tipo dummies!

housing.columns


housing.iloc[:,housing.columns.isin(('ocean_proximity','ocean_proximity')) ]##ocean_proximity : é uma variável categórica

pd.get_dummies(housing['ocean_proximity']) #Dummy


# # Escalonamento de variáveis 
# (importante para alg que calculam distâncias como a euclidiana (KNN...)


import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()

type(housing['ocean_proximity'])

scaler.fit_transform(housing[['median_income']])



# # Normalizando

min_max_scaler = MinMaxScaler()

min_max_scaler.fit_transform(housing[['population']]) #normalize the columns of this dataframe where each value is between 0 and 1


# # Balanceamento (Resampling) e escalonamento do datset:

##Fazer balancemento no treinamento para equilibrar a qtd de amostras de casa de cada categoria de preco

housing[housing['median_house_value']>=100000].count()


housing[housing['median_house_value']>=500000].count()

housing['median_house_value'].hist()#classes desbalanceadas!

# # Abordagem escolhida: Down-sample Majority Class

from sklearn.utils import resample

# Separate majority and minority classes

df_majority_down = housing[housing.median_house_value<=300000]
df_minority_down = housing[housing.median_house_value>=400000]
  • Are you making an algorithm to predict the price: does this mean that it is a regression algorithm (it spits out a number that is the predicted price) or a rating (it spits out a number that is a price range Bucket)? If it’s the price, why do you need to balance? You don’t have classes, just final values, right?

  • @Victor Capone: The doubt starts there... Does it make sense to balance in Brazil? is that there are price ranges that have few houses...

2 answers

3

I do not have extensive experience in the subject but, I believe that, in this case, if it is a problem of regression use, in my view there is no need to use class balancing, because it is possible to notice that there are final values (as stated in the comments of the question) and not classes per se.

Maybe you can use one of the regression algorithms, like Support Vector Machines (SVM) or Nearest Neighbors via Scikit-Learn.

  • But could not create "price ranges" as if they were classes to make the balancing?

-3

A class is the definition for a set of attributes that differentiate their own characteristics between them, such as fruits, colors, brands, etc. All this can and should be balanced so that there is no underfitting or overfitting. Something like price is not defined as a class of something, so it can not be balanced, but can be normalized, so that the numbers are well behaved, in addition to also be removed outliers.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.