Manipulation of columns with pandas

Asked

Viewed 1,473 times

1

I’m running a regression where I have 3 parameters and a column with categories.

As sklearn does not recognize categories I turn them into Dummies ( create a column for each category and fill with 1 case belongs to the column category and zero otherwise )

from sklearn import preprocessing
myEncoder = preprocessing.OneHotEncoder()
myEncoder.fit(df_c_f[['segment_id']])
dummies = myEncoder.transform(df_c_f[['segment_id']]).toarray()

So my matrix that initially has n rows and 4 columns now has 3 columns + c columns of categories.

Doubt is on how I can iterate my first 3 columns with all Dummies so I end up with n rows and 3 * c columns.

I ran the following code to do this, but it only works for small matrices, any number a little large the code hangs

matrix = []
def itera_parametros_e_dummies(matrix1,matrix2):
    print(len(matrix1))
    if len(matrix1) != len(matrix2):
        print("matrizes de tamanhos diferentes")
    else:
        for i in range(len(matrix1)):
            matrix.append(np.dot(matrix1[i:i+1],(matrix2[i:i+1]))[0])
    return(matrix)

itera_parametros_e_dummies(log_orgc_traf,df_dummies)
  • I didn’t quite understand what you want to do, how would the structure of data_frame be ready?

1 answer

2

So the first thing is about creating the Dummies. Whenever you create the Dummies, you drop a column of them. If there are any n categories must exist n-1 columns of Dummies. This is what is called Dummy Variable Trap.

The process of OneHotEncoder should, by nature, always create the column with the same number of rows of the whole dataset. Instead of myEncoder.fit(df_c_f[['segment_id']]) uses dummies = myEncoder.fit_transform(df_c_f[['segment_id']]). Saves a line.

Also I did not understand very well what the reason of multiplication and what you expect the result.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.