I cannot reproduce the government data of PNAD 2013 from microdata

Asked

Viewed 425 times

4

The government announced last year the result of PNAD 2013, it appears that there 85.3% of the houses had piped water.

I downloaded the microdata from PNAD and with R tried to reproduce the analysis below the code:

setwd( "/saneamento_basico/")

# pacotes
# -------------------------------------------------------------------------------
library(xlsx)
library(descr)
library(survey)

# paths
# -------------------------------------------------------------------------------
dicionarioPath <- "PNAD/Dicionаrios e input/Dicionаrio de variаveis de domicбlios - PNAD 2013.xls"
pnadPath <- "PNAD/Dados/DOM2013.txt"
pnadCsvOut <- "PNAD/DOM2013.txt"

# traduz arquivo
# -------------------------------------------------------------------------------

# dicionario
dicionario <- read.xlsx(dicionarioPath, sheetIndex = 1, 
                        stringsAsFactors = FALSE, 
                        startRow = 4)[,1:3]

dicionario <- dicionario[complete.cases(dicionario),]
colnames(dicionario) <- c("start", "size", "var")
dicionario$end <- as.numeric(dicionario$start) + dicionario$size - 1

# traducao
 fwf2csv(fwffile = pnadPath, 
         csvfile = pnadCsvOut,
         names = dicionario$var,
          begin = dicionario$start,
          end = dicionario$end)

# le e formata pnad
# -------------------------------------------------------------------------------
 pnad <- read.csv(pnadCsvOut, sep = '\t')

# "Forma de abastecimento     de água"
coluna_formaDeAbastecimentoDeAgua <- "V4624" 
colunas_pesoDomicilio <- "V4611" # "Peso do domicílio"
strat <- "V4602" # "Estrato" 
id <- "V4618" # "PSU - Unidade primária de amostragem"

 # retira quem tem NA nos pesos
 pnad <- pnad[!is.na(pnad[,colunas_pesoDomicilio]),]
 # retira quem tem NA em forma de abastecimento
 pnad <- pnad[!is.na(pnad[,coluna_formaDeAbastecimentoDeAgua]),]

 # tranforma quem tem agua encanada em 1 e os outros em 0

 temAguaEncanada <- as.numeric(pnad[,coluna_formaDeAbastecimentoDeAgua] == 1)
 pnad[,coluna_formaDeAbastecimentoDeAgua] <- temAguaEncanada 

 sample.pnad <- svydesign(ids = ~V4618, 
                          strata = ~V4602,
                          weights = ~V4611, 
                          data = pnad,
                          nest = TRUE)

 # para evitar erro "has only one PSU at stage 1"
 options(survey.lonely.psu = "adjust")

 # deveria retornar a porcentagem de que​m tem​​ agua encanada (0.853)
 # mas retorna 0.83964
 svymean(~V4624, design = sample.pnad, na.rm=TRUE)

With this I get the result of 84%. 1.3% less than the government.

This is the first time I use the Survey package, so I wanted to know what I’m doing wrong.

  • I don’t know if this is the case, but the PNAD dictionary has a problem: two variables share characters from the database (the UF and City). Make sure your file doesn’t have that. If it does, delete the UF line (it has to be this one) from the dictionary and try again!

  • Small correction: State and Control Number (V0102), not State and Municipality

  • 1

    I checked your code and that’s not the problem... maybe not even a problem: in IBGE itself it says that the percentage is 83.96% (http://goo.gl/qAzi1X)

  • Thank you very much @Rcoster. Do you have any idea why this number diverges from government data?

  • It seems to me to be error of the newspaper sources. A website, for example, cite the Sheet as the source of the 85.6% value. But in own web page The figure is 86.4%. EBC is a state-owned news organization, but it’s not the research team. Your data has to match those of IBGE, and apparently are correct.

  • 1

    I signaled the question to be closed because there is no problem in the code, the result matches the official reference. The error is in the other sources, there is no doubt about programming here.

Show 1 more comment
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.