Max of a numeric field returning NA

Question

Max of a numeric field returning NA

Asked 9 years, 5 months ago

Viewed 80 times

6

I’m starting to learn R and I came across a situation I don’t understand. I downloaded the data from ENEM 2014 (CSV file) and read using:

dados_enem <- read.csv(file="MICRODADOS_ENEM_2014.csv", header = TRUE, sep = ",")

When I ask to calculate the maximum, minimum or average of a given numeric field, it returns perfectly. For example, the NU_NOTA_REDACAO field:

max(dados_enem$NU_NOTA_REDACAO)  
min(dados_enem$NU_NOTA_REDACAO)  
mean(dados_enem$NU_NOTA_REDACAO)

    > max(dados_enem$NU_NOTA_REDACAO)  
    [1] 1000  
    > min(dados_enem$NU_NOTA_REDACAO)  
    [1] 0  
    > mean(dados_enem$NU_NOTA_REDACAO)  
    [1] 323.4219

However, when doing the same for NOTA_CN or NOTA_CH fields, both of the same format as NU_NOTA_REDACAO, I am shown NA:

max(dados_enem$NOTA_CN)  
min(dados_enem$NOTA_CN)  
mean(dados_enem$NOTA_CN)

max(data_enem$NOTA_CN)
[1] NA
min(data_enem$NOTA_CN)
[1] NA
Mean(data_enem$NOTA_CN)
[1] NA

I tried to force the conversion to numeric, but the result was the same:

data_enem$NOTA_CN = as.Numeric(as.Character(data_enem$NOTA_CN))
max(data_enem$NOTA_CN)
[1] NA

The file is quite large (almost 9 million records and 166 columns, but follows a sample of the data from this column:

[4513]    NA    NA 462.1 483.1 541.7    NA 527.8    NA    NA 456.9 639.5 527.9 535.1    NA    NA    NA  
 [4529] 505.7 389.3 391.7 764.9 527.5 459.3 481.1    NA 438.7 609.3 591.8 438.3 538.2    NA 493.5    NA  
 [4545]    NA 396.8    NA 486.3 566.1    NA    NA    NA 529.8 620.5 477.0 404.4 446.2 547.4    NA 460.5  
 [4561]    NA    NA 541.8    NA    NA 544.2 605.2 584.5    NA    NA 523.2 541.7    NA 523.1 528.7    NA

What am I doing wrong?

Grateful to all!

2

Have you tried deleting the lines with NA? You can use dados_enem = na.omit(dados_enem) or make the calculation call as follows: mean(dados_enem$NOTA_CN, na.rm=TRUE). More details here: http://www.statmethods.net/input/missingdata.html

– Luiz Vieira

2016/02/25 at 14:50
1

Perfect! It worked with the na.rm=TRUE option. Thank you very much! A doubt, when I use the option na.omit(), it just ignores the null data or removes it?

– Sandro

2016/02/25 at 15:08
3

@Sandro with the na.omit data is removed from the object. Also, it is important to remember that it deletes all lines that have at least one value NA.

– Daniel Falbel

2016/02/25 at 15:11
As Danielfalbel commented, the na.omit removes all rows where at least one column has NA. So if in a certain row, one of the columns has a NA that you could ignore punctually (for example, when calculating the average that column), remember that if you use the na.omit all that row will be deleted (and the valid data of the other columns will be lost). It has to be analyzed case by case, because sometimes that NA is not problematic, and sometimes its lack is important enough to invalidate all the rest of the columns.

– Luiz Vieira

2016/02/25 at 18:07
2

In your case, for example, the note seems to be quite important. So, a line in it has NA right in that column maybe can be totally eliminated because the other data of that line would become useless for analysis you want to do.

– Luiz Vieira

2016/02/25 at 18:08
2

Got it @Luizvieira, thanks for the explanation!

– Sandro

2016/02/25 at 18:50

Show 1 more comment

1 answer

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Daniel Falbel • **12,504** points · Answer 1 · 2016-02-25T14:56:31+00:00

4

Try the following:

max(dados_enem$NOTA_CN, na.rm = TRUE)  
min(dados_enem$NOTA_CN, na.rm = TRUE)  
mean(dados_enem$NOTA_CN, na.rm = TRUE)

By default, these functions return NA result when there is NA data in the vector. You need to explicitly warn that you want to delete them from the result.

This confuses a lot who is starting in R because even there is no pattern between its functions. For example, the function summary and the function table by default ignore the presence of NA's

It worked! It would be prudent then always this option, correct? Very grateful!

– Sandro

2016/02/25 at 15:09
1

@Sandro, it depends a lot on what you want. Often NA's represent some information that could be used in the calculation of the average and should not necessarily be excluded.

– Daniel Falbel

2016/02/25 at 15:13
1

I understand, but with the forgiveness of ignorance, if the fields containing NA prevent the calculation of the mean, since the function returns NA when it finds them, how could they be included in this calculation? I’m not sure I understand...

– Sandro

2016/02/25 at 15:19
1

@Sandro O default FALSE is more prudent. It avoids a function over 1000 values at which 900 are NA return a numerical result as if there were no missing values, generating possibly incorrect or biased information. The requirement to define na.rm = TRUE explicitly causes the programmer to assume the risk of ignoring the values NA.

– Molx

2016/02/26 at 04:04
1

Daniel, I think it’s a good idea to always use FALSE and TRUE in the responses, as recommended in several places. T and F are only variables that can be modified and generate problems. This is especially important for beginners who may not understand what is happening.

– Molx

2016/02/26 at 04:06
1

In time: table only counts the occurrences, and can count Nas as well, so don’t take the risks I mentioned above. summary informs the number of Nas, when they exist, which is not as quiet as a na.rm = TRUE default would be.

– Molx

2016/02/26 at 04:12
1

@Molx I think the doubt of AP is more in the sense that even with the default FALSE being more prudent, what is the purpose of this since it is enough to have a NA for the job at all times return NA?

– Luiz Vieira

2016/02/26 at 11:42
Exactly @Luizvieira, if you leave the default as FALSE the function (max, min or Mean) returns NA. The question would be: for these cases cited, the "Mean" function, for example, calculates the average based on all records (including those with NA value) or only for those with some value (even zero) in the column? For example, for a set of 100 lines, where 30 have NA in the column used, the average is calculated based on 100 or 70?

– Sandro

2016/02/26 at 19:57

Show 3 more comments