SAS and R run Merge differently

Asked

Viewed 647 times

3

What is the difference between how R and SAS perform merge?
The SAS Merge command returns 205546 lines whereas the R command returns 207208 lines.
Following example.

I am working with the IBGE file available at:
ftp://ftp.ibge.gov.br/PNS/2013/microdados/pns_2013_microdados.zip

DOMPNS2013.txt and PESPNS2013.txt bases will be used

SAS:
1) Variable assignment: run the files "DOMPNS2013 input" and "PESPNS2013 input"
2) Selection of an interest value and Merge:

data dompns2013v3;  
set dompns2013;  
if V0015 = 1;  
run;  
/*NOTE: There were 81187 observations read from the data set WORK.DOMPNS2013.
NOTE: The data set WORK.DOMPNS2013V2 has 64348 observations and 20 variables.*/  

data arq.dompes2013v3;  
merge dompns2013v3 pespns2013;   
by v0001 v0024 upa_pns v0006;  
run;  
/*NOTE: There were 64348 observations read from the data set WORK.DOMPNS2013V2.
NOTE: There were 205546 observations read from the data set WORK.PESPNS2013.
NOTE: The data set ARQ.DOMPES2013V2 has 205546 observations and 388 variables.
NOTE: DATA statement used (Total process time):*/  
#

R: 1) assignment of variables:

d2013 = read.fwf(file='DOMPNS2013.txt',widths=c(2,8,7,4,2,6,1,1))  

names(d2013) = c("v0001","v0024","upa_pns","v0006","v0015","skip1","v0026","v0031")  

d2013 = subset(d2013,select=c("v0001","v0024","upa_pns","v0006","v0015","v0026","v0031"))  

p2013 = read.fwf(file='PESPNS2013.txt',widths=c(2,8,7,4,1,2,2,2,1,8,3))  

names(p2013)=c("v0001","v0024","upa_pns","v0006","v0025","skip1","c00301","c004","c006","skip2","c008")  

p2013=subset(p2013,select=c("v0001","v0024","upa_pns","v0006","v0025","c00301","c004","c006","c008"))  

2) Selection of an interest value and Merge:

dim(d2013)  
[1] 81187     7  

d2013 = subset(d2013, d2013$v0015 == 1)  
dim(d2013)  
[1] 64348     7  

dim(p2013)  
[1] 205546      9  

dpmerge = merge( p2013,d2013,by=c("v0001","v0024","upa_pns","v0006"))  
dim(dpmerge)  
[1] 207208     12  
  • 1

    Henry, it seems to me that the SAS is removing the duplicate records of DOMPNS before merging. If you do d2013 <- unique(d2013) before merging into R, the number of observations will be equal.

2 answers

1

SAS is removing duplicate records from DOMPNS before merging.

If you do d2013 <- unique(d2013) before merging into R, the number of observations will be equal.

0

Hello, Henrique! I don’t know how to merge in R, but in SAS yes. To use the merge in SAS you need to pre-sort the two datasets by the crossing key, because if we happen to have a lower key value at later values, they will be ignored.

In your case, the following code (placed just before using the merge) would solve the problem;

proc sort data dompns2013v3; by v0001 v0024 upa_pns v0006; run; proc sort data pespns2013; by v0001 v0024 upa_pns v0006; run;

What I would recommend is to use proc sql to cross tables. The merge is the most efficient option for crossing, but the tables need to be sorted. Using the proc sql, On the other hand, you don’t need to sort the databases, besides the fact that using the SQL language is something more versatile, because you can use it in several databases. Since I don’t know the structure of the tables you are using, I can’t give you an equivalent SQL code, but you can find it on the internet easily.

Some considerations/tips that are not directly related to doubt but may be useful:

-to filter records, use the clause WHERE rather than if, as it is more efficient. In your case you exchange this code snippet (if V0015 = 1;) this way (Where V0015 = 1;); - if there is a possibility of having duplicate keys and assuming also that you only want unique records, add the option nodupkey in the proc Sort (after the code date=base_input). Remembering that it will only keep the first value that the key appears at the base; then, if the order in which the key appears, and not just the key itself, is important, you first use the proc Sort without the option nodupkey (placing in ordination - clause by - the keys and criteria) and then uses the proc Sort with the option nodupkey (putting in the clause by only the key).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.