Problems with incorrect data in a dataset (using Pandas)

Question

Problems with incorrect data in a dataset (using Pandas)

Asked 7 years, 2 months ago

Viewed 95 times

2

I have a dataset called Auto.csv, which has the form:

mpg cylinders   displacement    horsepower  weight  acceleration    year    origin  name
18  8   307 130 3504    12  70  1   chevrolet chevelle malibu
15  8   350 165 3693    11.5    70  1   buick skylark 320
18  8   318 150 3436    11  70  1   plymouth satellite
16  8   304 150 3433    12  70  1   amc rebel sst
17  8   302 140 3449    10.5    70  1   ford torino
15  8   429 198 4341    10  70  1   ford galaxie 500
14  8   454 220 4354    9   70  1   chevrolet impala
14  8   440 215 4312    8.5 70  1   plymouth fury iii
14  8   455 225 4425    10  70  1   pontiac catalina
?   8   390 190 3850    8.5 70  1   amc ambassador dpl
15  8   383 170 3563    10  70  1   dodge challenger se
14  8   340 160 3609    8   70  1   plymouth 'cuda 340
15  8   400 150 3761    9.5 70  1   chevrolet monte carlo
14  8   455 225 3086    10  70  1   buick estate wagon (sw)
24  4   113 95  2372    15  70  3   toyota corona mark ii
22  6   198 95  2833    15.5    70  1   plymouth duster
18  6   199 97  2774    15.5    70  1   amc hornet
21  6   200 85  2587    16  70  1   ford maverick

Imagine that the dataset has 1 million lines and possibly somewhere there is a character in place of a number, ie by an error, somewhere in the column mpg which should contain only numbers there is a strange character like '?'

import pandas as pd
import numpy as np

carros = pd.read_csv('Auto.csv',sep =',')
print(carros['mpg'])

How can I test whether the column mpg for example, it contains only numbers and, if not, return a new Pandas dataframe with the line removed?

Dataset source: http://www-bcf.usc.edu/~Gareth/ISL/data.html

1 answer

Browser other questions tagged python-3.x pandas

You are not signed in. Login or sign up in order to post.

by Gomiero • **5,286** points · Answer 1 · 2018-05-28T00:55:16+00:00

A safe way to test whether the line value is numerical (for the question data), is to define a function that attempts to perform the conversion to int and check if there was a mistake:

def e_numero(x):
    try:
        # tenta converter para inteiro
        int(x)
        # retorna verdadeiro se conseguir...
        return True
    except ValueError:
        # ... ou falso, caso não consiga
        return False

After setting the function, to test if there are lines with incorrect value in the column:

dt[dt['mpg'].apply(lambda x: not(e_numero(x)))]

And to generate the new Dataframe without the incorrect lines:

dt_novo = dt[dt['mpg'].apply(lambda x: e_numero(x))]

Follow an example section:

# Dados
dt
Out[54]:
   mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                       name
0   18          8           307         130    3504          12.0    70       1  chevrolet chevelle malibu
1   15          8           350         165    3693          11.5    70       1          buick skylark 320
2   18          8           318         150    3436          11.0    70       1         plymouth satellite
3   16          8           304         150    3433          12.0    70       1              amc rebel sst
4   17          8           302         140    3449          10.5    70       1                ford torino
5   15          8           429         198    4341          10.0    70       1           ford galaxie 500
6   14          8           454         220    4354           9.0    70       1           chevrolet impala
7   14          8           440         215    4312           8.5    70       1          plymouth fury iii
8   14          8           455         225    4425          10.0    70       1           pontiac catalina
9    ?          8           390         190    3850           8.5    70       1         amc ambassador dpl
10  15          8           383         170    3563          10.0    70       1        dodge challenger se
11  14          8           340         160    3609           8.0    70       1         plymouth 'cuda 340
12  15          8           400         150    3761           9.5    70       1      chevrolet monte carlo
13  14          8           455         225    3086          10.0    70       1    buick estate wagon (sw)
14  24          4           113          95    2372          15.0    70       3      toyota corona mark ii
15  22          6           198          95    2833          15.5    70       1            plymouth duster
16  18          6           199          97    2774          15.5    70       1                 amc hornet
17  21          6           200          85    2587          16.0    70       1              ford maverick

# AQUI => Lista as linhas com valores não numéricos na coluna mpg
dt[dt['mpg'].apply(lambda x: not(e_numero(x)))]
Out[55]:
  mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                name
9   ?          8           390         190    3850           8.5    70       1  amc ambassador dpl

# AQUI => Cria o DataFrame dt_novo sem as linhas com valores incorretos
dt_novo = dt[dt['mpg'].apply(lambda x: e_numero(x))]

dt_novo
Out[57]:
   mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                       name
0   18          8           307         130    3504          12.0    70       1  chevrolet chevelle malibu
1   15          8           350         165    3693          11.5    70       1          buick skylark 320
2   18          8           318         150    3436          11.0    70       1         plymouth satellite
3   16          8           304         150    3433          12.0    70       1              amc rebel sst
4   17          8           302         140    3449          10.5    70       1                ford torino
5   15          8           429         198    4341          10.0    70       1           ford galaxie 500
6   14          8           454         220    4354           9.0    70       1           chevrolet impala
7   14          8           440         215    4312           8.5    70       1          plymouth fury iii
8   14          8           455         225    4425          10.0    70       1           pontiac catalina
10  15          8           383         170    3563          10.0    70       1        dodge challenger se
11  14          8           340         160    3609           8.0    70       1         plymouth 'cuda 340
12  15          8           400         150    3761           9.5    70       1      chevrolet monte carlo
13  14          8           455         225    3086          10.0    70       1    buick estate wagon (sw)
14  24          4           113          95    2372          15.0    70       3      toyota corona mark ii
15  22          6           198          95    2833          15.5    70       1            plymouth duster
16  18          6           199          97    2774          15.5    70       1                 amc hornet
17  21          6           200          85    2587          16.0    70       1              ford maverick