Use drop or iloc in Machine Learning modeling in Pandas?

Asked

Viewed 56 times

-1

I’m learning Machine Learning for Data Science through Pandas. I made a few algorithms and performed the division of my predictive variables and class as follows:

dados = pd.read_csv(...)
(...)
previsores = dados.drop('income', axis=1)
classe = dados['income']

However, I started percerber some people using iloc, for example:

dados = pd.read_csv(...)
(...)
previsores = dados.iloc[:,0:14].values
classe = dados.iloc[:,14].values

I was thinking that both might be correct, but I realized that the type of the first one was a Pandas DF and the second one a numpy.array.

Could someone enlighten me if the way I was doing before was wrong, and what the implications of doing one way or another?

  • 1

    André, good morning! It’s not a matter of wrong or right. Some algorithms ask for a numpy array, others you can deliver as a data frame yourself. In your example, if you do dados['income'].values also have the numpy array, regardless of using the locor iloc. With iloc or slicing usually you don’t need to worry about the name of the columns.

  • Got it, thank you very much. In my case here I have to use Onehotencoder, so I need Arrays!

1 answer

0

What happens is this, the iloc receives two parameters, lines and columns: iloc[linhas,colunas]. Lines can be a single line or more than one (the same goes for column). If it is just one you simply say what the line number is, if it is more than one you can use a list or a range, which in python is represented with :.

How this interval works: começo:fim:passo represents a range that starts at começo ends before fim jumping from step to step, for example 1:10:2 will use the numbers from 1 until 10, not including the 10, jumping 2 in 2, ie: 1,3,5,7,9. When you do not set the step, the python means that it is 1. Then 0:5 would be 0,1,2,3,4. The step can also be negative 4:1:-1 would be 4,3,2. when you do not set the beginning, python means that the beginning is 0. Therefore 0:14 is the same thing as :14. Finally, if you don’t set the end, python understands that it should take the entire range.

In the case of your code we have two ranges being used in the iloc. The first is dados.iloc[:,0:14] where it is being informed that for the column the interval is :, that is, neither the beginning, nor the end and not the step are being defined, which means that the python will use all lines, from line 0 to the last, inclusive. For the column 0:14 indicates that it will be all columns from column 0 to column 14, not including column 14 (from 0 to 13). So you have multiple rows and 14 columns. That is, a 2-dimensional matrix. In the second case you have iloc[:,14], which means again all lines, but only column 14. I mean, you only have one dimension.

For the pandas, dados.iloc[:,14] would still be a Dataframe with a single column, but .values asks pandas to return the values of this Dataframe and, according to the definition of pandas, if this Dataframe has a single dimension, it returns an array of numpy.

  • Very good explanation, this question of the intervals I was already understanding, but still your comment was very illuminating. My doubt was more even related to what would be the difference in relation to the types and how this influenced the code, but as they answered above, for certain situations it is better to keep as DF (copying and just dropping the useless columns) and in other cases is better array. In my case, I am using Onehotencoder, which requires arrays, so my previous method would be 'wrong'. Thanks for the explanation, I’ll even save the kkkkk comment

  • Thank you. If you thought it helped, please mark as useful.

  • I scored!! Thanks a lot

  • to mark as useful you need to click once on the arrow up to the left of the question.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.