6
I’m starting to learn R and I came across a situation I don’t understand. I downloaded the data from ENEM 2014 (CSV file) and read using:
dados_enem <- read.csv(file="MICRODADOS_ENEM_2014.csv", header = TRUE, sep = ",")
When I ask to calculate the maximum, minimum or average of a given numeric field, it returns perfectly. For example, the NU_NOTA_REDACAO field:
max(dados_enem$NU_NOTA_REDACAO)
min(dados_enem$NU_NOTA_REDACAO)
mean(dados_enem$NU_NOTA_REDACAO)
> max(dados_enem$NU_NOTA_REDACAO)
[1] 1000
> min(dados_enem$NU_NOTA_REDACAO)
[1] 0
> mean(dados_enem$NU_NOTA_REDACAO)
[1] 323.4219
However, when doing the same for NOTA_CN or NOTA_CH fields, both of the same format as NU_NOTA_REDACAO, I am shown NA:
max(dados_enem$NOTA_CN)
min(dados_enem$NOTA_CN)
mean(dados_enem$NOTA_CN)
max(data_enem$NOTA_CN)
[1] NA
min(data_enem$NOTA_CN)
[1] NA
Mean(data_enem$NOTA_CN)
[1] NA
I tried to force the conversion to numeric, but the result was the same:
data_enem$NOTA_CN = as.Numeric(as.Character(data_enem$NOTA_CN))
max(data_enem$NOTA_CN)
[1] NA
The file is quite large (almost 9 million records and 166 columns, but follows a sample of the data from this column:
[4513] NA NA 462.1 483.1 541.7 NA 527.8 NA NA 456.9 639.5 527.9 535.1 NA NA NA
[4529] 505.7 389.3 391.7 764.9 527.5 459.3 481.1 NA 438.7 609.3 591.8 438.3 538.2 NA 493.5 NA
[4545] NA 396.8 NA 486.3 566.1 NA NA NA 529.8 620.5 477.0 404.4 446.2 547.4 NA 460.5
[4561] NA NA 541.8 NA NA 544.2 605.2 584.5 NA NA 523.2 541.7 NA 523.1 528.7 NA
What am I doing wrong?
Grateful to all!
Have you tried deleting the lines with NA? You can use
dados_enem = na.omit(dados_enem)
or make the calculation call as follows:mean(dados_enem$NOTA_CN, na.rm=TRUE)
. More details here: http://www.statmethods.net/input/missingdata.html– Luiz Vieira
Perfect! It worked with the na.rm=TRUE option. Thank you very much! A doubt, when I use the option na.omit(), it just ignores the null data or removes it?
– Sandro
@Sandro with the
na.omit
data is removed from the object. Also, it is important to remember that it deletes all lines that have at least one valueNA
.– Daniel Falbel
As Danielfalbel commented, the
na.omit
removes all rows where at least one column hasNA
. So if in a certain row, one of the columns has aNA
that you could ignore punctually (for example, when calculating the average that column), remember that if you use thena.omit
all that row will be deleted (and the valid data of the other columns will be lost). It has to be analyzed case by case, because sometimes thatNA
is not problematic, and sometimes its lack is important enough to invalidate all the rest of the columns.– Luiz Vieira
In your case, for example, the note seems to be quite important. So, a line in it has
NA
right in that column maybe can be totally eliminated because the other data of that line would become useless for analysis you want to do.– Luiz Vieira
Got it @Luizvieira, thanks for the explanation!
– Sandro