Max of a numeric field returning NA

Asked

Viewed 80 times

6

I’m starting to learn R and I came across a situation I don’t understand. I downloaded the data from ENEM 2014 (CSV file) and read using:

dados_enem <- read.csv(file="MICRODADOS_ENEM_2014.csv", header = TRUE, sep = ",")

When I ask to calculate the maximum, minimum or average of a given numeric field, it returns perfectly. For example, the NU_NOTA_REDACAO field:

max(dados_enem$NU_NOTA_REDACAO)  
min(dados_enem$NU_NOTA_REDACAO)  
mean(dados_enem$NU_NOTA_REDACAO)

    > max(dados_enem$NU_NOTA_REDACAO)  
    [1] 1000  
    > min(dados_enem$NU_NOTA_REDACAO)  
    [1] 0  
    > mean(dados_enem$NU_NOTA_REDACAO)  
    [1] 323.4219 

However, when doing the same for NOTA_CN or NOTA_CH fields, both of the same format as NU_NOTA_REDACAO, I am shown NA:

max(dados_enem$NOTA_CN)  
min(dados_enem$NOTA_CN)  
mean(dados_enem$NOTA_CN) 

max(data_enem$NOTA_CN)
[1] NA
min(data_enem$NOTA_CN)
[1] NA
Mean(data_enem$NOTA_CN)
[1] NA

I tried to force the conversion to numeric, but the result was the same:

data_enem$NOTA_CN = as.Numeric(as.Character(data_enem$NOTA_CN))
max(data_enem$NOTA_CN)
[1] NA

The file is quite large (almost 9 million records and 166 columns, but follows a sample of the data from this column:

[4513]    NA    NA 462.1 483.1 541.7    NA 527.8    NA    NA 456.9 639.5 527.9 535.1    NA    NA    NA  
 [4529] 505.7 389.3 391.7 764.9 527.5 459.3 481.1    NA 438.7 609.3 591.8 438.3 538.2    NA 493.5    NA  
 [4545]    NA 396.8    NA 486.3 566.1    NA    NA    NA 529.8 620.5 477.0 404.4 446.2 547.4    NA 460.5  
 [4561]    NA    NA 541.8    NA    NA 544.2 605.2 584.5    NA    NA 523.2 541.7    NA 523.1 528.7    NA  

What am I doing wrong?

Grateful to all!

  • 2

    Have you tried deleting the lines with NA? You can use dados_enem = na.omit(dados_enem) or make the calculation call as follows: mean(dados_enem$NOTA_CN, na.rm=TRUE). More details here: http://www.statmethods.net/input/missingdata.html

  • 1

    Perfect! It worked with the na.rm=TRUE option. Thank you very much! A doubt, when I use the option na.omit(), it just ignores the null data or removes it?

  • 3

    @Sandro with the na.omit data is removed from the object. Also, it is important to remember that it deletes all lines that have at least one value NA.

  • As Danielfalbel commented, the na.omit removes all rows where at least one column has NA. So if in a certain row, one of the columns has a NA that you could ignore punctually (for example, when calculating the average that column), remember that if you use the na.omit all that row will be deleted (and the valid data of the other columns will be lost). It has to be analyzed case by case, because sometimes that NA is not problematic, and sometimes its lack is important enough to invalidate all the rest of the columns.

  • 2

    In your case, for example, the note seems to be quite important. So, a line in it has NA right in that column maybe can be totally eliminated because the other data of that line would become useless for analysis you want to do.

  • 2

    Got it @Luizvieira, thanks for the explanation!

Show 1 more comment

1 answer

4


Try the following:

max(dados_enem$NOTA_CN, na.rm = TRUE)  
min(dados_enem$NOTA_CN, na.rm = TRUE)  
mean(dados_enem$NOTA_CN, na.rm = TRUE)

By default, these functions return NA result when there is NA data in the vector. You need to explicitly warn that you want to delete them from the result.

This confuses a lot who is starting in R because even there is no pattern between its functions. For example, the function summary and the function table by default ignore the presence of NA's

  • It worked! It would be prudent then always this option, correct? Very grateful!

  • 1

    @Sandro, it depends a lot on what you want. Often NA's represent some information that could be used in the calculation of the average and should not necessarily be excluded.

  • 1

    I understand, but with the forgiveness of ignorance, if the fields containing NA prevent the calculation of the mean, since the function returns NA when it finds them, how could they be included in this calculation? I’m not sure I understand...

  • 1

    @Sandro O default FALSE is more prudent. It avoids a function over 1000 values at which 900 are NA return a numerical result as if there were no missing values, generating possibly incorrect or biased information. The requirement to define na.rm = TRUE explicitly causes the programmer to assume the risk of ignoring the values NA.

  • 1

    Daniel, I think it’s a good idea to always use FALSE and TRUE in the responses, as recommended in several places. T and F are only variables that can be modified and generate problems. This is especially important for beginners who may not understand what is happening.

  • 1

    In time: table only counts the occurrences, and can count Nas as well, so don’t take the risks I mentioned above. summary informs the number of Nas, when they exist, which is not as quiet as a na.rm = TRUE default would be.

  • 1

    @Molx I think the doubt of AP is more in the sense that even with the default FALSE being more prudent, what is the purpose of this since it is enough to have a NA for the job at all times return NA?

  • Exactly @Luizvieira, if you leave the default as FALSE the function (max, min or Mean) returns NA. The question would be: for these cases cited, the "Mean" function, for example, calculates the average based on all records (including those with NA value) or only for those with some value (even zero) in the column? For example, for a set of 100 lines, where 30 have NA in the column used, the average is calculated based on 100 or 70?

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.