Problems with incorrect data in a dataset (using Pandas)

Asked

Viewed 95 times

2

I have a dataset called Auto.csv, which has the form:

mpg cylinders   displacement    horsepower  weight  acceleration    year    origin  name
18  8   307 130 3504    12  70  1   chevrolet chevelle malibu
15  8   350 165 3693    11.5    70  1   buick skylark 320
18  8   318 150 3436    11  70  1   plymouth satellite
16  8   304 150 3433    12  70  1   amc rebel sst
17  8   302 140 3449    10.5    70  1   ford torino
15  8   429 198 4341    10  70  1   ford galaxie 500
14  8   454 220 4354    9   70  1   chevrolet impala
14  8   440 215 4312    8.5 70  1   plymouth fury iii
14  8   455 225 4425    10  70  1   pontiac catalina
?   8   390 190 3850    8.5 70  1   amc ambassador dpl
15  8   383 170 3563    10  70  1   dodge challenger se
14  8   340 160 3609    8   70  1   plymouth 'cuda 340
15  8   400 150 3761    9.5 70  1   chevrolet monte carlo
14  8   455 225 3086    10  70  1   buick estate wagon (sw)
24  4   113 95  2372    15  70  3   toyota corona mark ii
22  6   198 95  2833    15.5    70  1   plymouth duster
18  6   199 97  2774    15.5    70  1   amc hornet
21  6   200 85  2587    16  70  1   ford maverick

Imagine that the dataset has 1 million lines and possibly somewhere there is a character in place of a number, ie by an error, somewhere in the column mpg which should contain only numbers there is a strange character like '?'

import pandas as pd
import numpy as np

carros = pd.read_csv('Auto.csv',sep =',')
print(carros['mpg'])

How can I test whether the column mpg for example, it contains only numbers and, if not, return a new Pandas dataframe with the line removed?

Dataset source: http://www-bcf.usc.edu/~Gareth/ISL/data.html

1 answer

2


A safe way to test whether the line value is numerical (for the question data), is to define a function that attempts to perform the conversion to int and check if there was a mistake:

def e_numero(x):
    try:
        # tenta converter para inteiro
        int(x)
        # retorna verdadeiro se conseguir...
        return True
    except ValueError:
        # ... ou falso, caso não consiga
        return False

After setting the function, to test if there are lines with incorrect value in the column:

dt[dt['mpg'].apply(lambda x: not(e_numero(x)))]

And to generate the new Dataframe without the incorrect lines:

dt_novo = dt[dt['mpg'].apply(lambda x: e_numero(x))]

Follow an example section:

# Dados
dt
Out[54]:
   mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                       name
0   18          8           307         130    3504          12.0    70       1  chevrolet chevelle malibu
1   15          8           350         165    3693          11.5    70       1          buick skylark 320
2   18          8           318         150    3436          11.0    70       1         plymouth satellite
3   16          8           304         150    3433          12.0    70       1              amc rebel sst
4   17          8           302         140    3449          10.5    70       1                ford torino
5   15          8           429         198    4341          10.0    70       1           ford galaxie 500
6   14          8           454         220    4354           9.0    70       1           chevrolet impala
7   14          8           440         215    4312           8.5    70       1          plymouth fury iii
8   14          8           455         225    4425          10.0    70       1           pontiac catalina
9    ?          8           390         190    3850           8.5    70       1         amc ambassador dpl
10  15          8           383         170    3563          10.0    70       1        dodge challenger se
11  14          8           340         160    3609           8.0    70       1         plymouth 'cuda 340
12  15          8           400         150    3761           9.5    70       1      chevrolet monte carlo
13  14          8           455         225    3086          10.0    70       1    buick estate wagon (sw)
14  24          4           113          95    2372          15.0    70       3      toyota corona mark ii
15  22          6           198          95    2833          15.5    70       1            plymouth duster
16  18          6           199          97    2774          15.5    70       1                 amc hornet
17  21          6           200          85    2587          16.0    70       1              ford maverick

# AQUI => Lista as linhas com valores não numéricos na coluna mpg
dt[dt['mpg'].apply(lambda x: not(e_numero(x)))]
Out[55]:
  mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                name
9   ?          8           390         190    3850           8.5    70       1  amc ambassador dpl

# AQUI => Cria o DataFrame dt_novo sem as linhas com valores incorretos
dt_novo = dt[dt['mpg'].apply(lambda x: e_numero(x))]

dt_novo
Out[57]:
   mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                       name
0   18          8           307         130    3504          12.0    70       1  chevrolet chevelle malibu
1   15          8           350         165    3693          11.5    70       1          buick skylark 320
2   18          8           318         150    3436          11.0    70       1         plymouth satellite
3   16          8           304         150    3433          12.0    70       1              amc rebel sst
4   17          8           302         140    3449          10.5    70       1                ford torino
5   15          8           429         198    4341          10.0    70       1           ford galaxie 500
6   14          8           454         220    4354           9.0    70       1           chevrolet impala
7   14          8           440         215    4312           8.5    70       1          plymouth fury iii
8   14          8           455         225    4425          10.0    70       1           pontiac catalina
10  15          8           383         170    3563          10.0    70       1        dodge challenger se
11  14          8           340         160    3609           8.0    70       1         plymouth 'cuda 340
12  15          8           400         150    3761           9.5    70       1      chevrolet monte carlo
13  14          8           455         225    3086          10.0    70       1    buick estate wagon (sw)
14  24          4           113          95    2372          15.0    70       3      toyota corona mark ii
15  22          6           198          95    2833          15.5    70       1            plymouth duster
16  18          6           199          97    2774          15.5    70       1                 amc hornet
17  21          6           200          85    2587          16.0    70       1              ford maverick
  • I tested the code cars[cars['mpg']. apply(lambda x: not(e_numero(x))] but did not return anything!

  • 1

    Did you test with the question data? Sure the data has an invalid row in the 'mpg' column'?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.