Use drop or iloc in Machine Learning modeling in Pandas?

Question

Use drop or iloc in Machine Learning modeling in Pandas?

Asked 4 years, 8 months ago

Viewed 56 times

-1

I’m learning Machine Learning for Data Science through Pandas. I made a few algorithms and performed the division of my predictive variables and class as follows:

dados = pd.read_csv(...)
(...)
previsores = dados.drop('income', axis=1)
classe = dados['income']

However, I started percerber some people using iloc, for example:

dados = pd.read_csv(...)
(...)
previsores = dados.iloc[:,0:14].values
classe = dados.iloc[:,14].values

I was thinking that both might be correct, but I realized that the type of the first one was a Pandas DF and the second one a numpy.array.

Could someone enlighten me if the way I was doing before was wrong, and what the implications of doing one way or another?

1

André, good morning! It’s not a matter of wrong or right. Some algorithms ask for a numpy array, others you can deliver as a data frame yourself. In your example, if you do dados['income'].values also have the numpy array, regardless of using the locor iloc. With iloc or slicing usually you don’t need to worry about the name of the columns.

– lmonferrari

2020/11/12 at 14:35
Got it, thank you very much. In my case here I have to use Onehotencoder, so I need Arrays!

– André

2020/11/12 at 18:14

1 answer

Browser other questions tagged python database pandas machine-learning

You are not signed in. Login or sign up in order to post.

by Flavio Moraes • **351** points · Answer 1 · 2020-11-12T17:42:41+00:00

What happens is this, the iloc receives two parameters, lines and columns: iloc[linhas,colunas]. Lines can be a single line or more than one (the same goes for column). If it is just one you simply say what the line number is, if it is more than one you can use a list or a range, which in python is represented with :.

How this interval works: começo:fim:passo represents a range that starts at começo ends before fim jumping from step to step, for example 1:10:2 will use the numbers from 1 until 10, not including the 10, jumping 2 in 2, ie: 1,3,5,7,9. When you do not set the step, the python means that it is 1. Then 0:5 would be 0,1,2,3,4. The step can also be negative 4:1:-1 would be 4,3,2. when you do not set the beginning, python means that the beginning is 0. Therefore 0:14 is the same thing as :14. Finally, if you don’t set the end, python understands that it should take the entire range.

In the case of your code we have two ranges being used in the iloc. The first is dados.iloc[:,0:14] where it is being informed that for the column the interval is :, that is, neither the beginning, nor the end and not the step are being defined, which means that the python will use all lines, from line 0 to the last, inclusive. For the column 0:14 indicates that it will be all columns from column 0 to column 14, not including column 14 (from 0 to 13). So you have multiple rows and 14 columns. That is, a 2-dimensional matrix. In the second case you have iloc[:,14], which means again all lines, but only column 14. I mean, you only have one dimension.

For the pandas, dados.iloc[:,14] would still be a Dataframe with a single column, but .values asks pandas to return the values of this Dataframe and, according to the definition of pandas, if this Dataframe has a single dimension, it returns an array of numpy.