Return to Homepage

Goto the Tip of the Month Archive

Other interesting pages ...
LinkedIn Profile
SAS Cheat Sheet
Useful SAS Code
Full SAS Example
Basic Statistics
Contact Information

SAS Tip of the Month
July 2013
(for SAS and WPS)

One of the most significant software packages to complement SAS/WPS in recent years is R. That is a very bold statement but increasingly statisticians outside the academic environment are using R to do analysis of data, not SAS.

Cost is a factor, but the easy availability of specialist packages written by outside users to do tasks is certainly something to take note of, even by a SAS/WPS Programmer. I have heard the arguments saying that SAS/WPS software is checked by quality control people and therefore their statistics are correct but one argument against that claim is that the software to do the statistics is "in a black box" and users do not have access to the source code behind the calculations, unlike R where the source code is available.

An example I use is a paper I have presented entitled "Oh Quartile, Where Art Thou?" where the quartile calculation is discussed and compared against different packages, including SAS and R -- each have their own way of doing the calculation, sometimes multiple ways (which one is best, I leave that up to the Statisticians to decide). For the record, SAS has five different ways to calculate the quartile and I would bet most statisticians do not know what the default calculation is without looking up a reference. Anyway, I digress.

This month I look at a couple of examples getting data to R from SAS/WPS, doing some calculations (including a quartile that is not available in SAS) and then retrieve that data back into SAS/WPS. We will also have a quick look at an example of producing a graph.

First some data:

   data demog;
      infile cards;
      input id sex $ age htcm wtkg;
   cards;
   01 F 41 170 82.0
   02 M 23 172 72.8
   03 F 59 194 81.6
   04 F 24 174 72.3
   05 F 37 160 90.4
   06 M 38 184 83.2
   07 F 30 160 99.8
   08 M 39 175 78.3
   09 F 24 185 83.7
   10 F 64 188 92.3
   11 M 42 160 86.8
   12 F 37 176 80.6
   13 M 34 170 64.3
   14 M 22 172 96.8
   15 M 19 157 70.6
   16 F 19 155 95.1
   17 F 52 184 82.2
   18 F 26 160 83.6
   19 M 54 187 94.9
   20 F 30 190 63.1
   ;
   run;

R can't read SAS native datasets, but can read SAS XPORT Transport Files, this the code would be:

   libname demogxpt XPORT 'c:\temp\demogxpt.xpt';
   data demogxpt.demogxpt;
      set demog;
   run;

One reason for transferring the data via a SAS XPORT Transport File is that numeric precision of the transport file is far better than via a CSV file or most other type of files.

Now that we have the SAS XPORT Transport File, we need to write the code needed to run inside R -- this will be first placed in an external file called program.r. Note that our first statistics will be just getting the N, mean, standard deviation, median, minimum and maximum values for each SEX value.

   *Create PROGRAM.R file;
   DATA _null_;
      FILE 'c:\temp\program.r' LRECL=1024;

      put '#Bring in additional packages needed for analysis';
      put 'library(SASxport)';
      put 'library(doBy)';

      put '#Read in data -- note "/" for directory names, even in Windows';
      put 'mydata = read.xport("c:/temp/demogxpt.xpt")';

      put '#Result calculation for AGE by SEX';
      put 'r1<-summaryBy(AGE~SEX,data=mydata,FUN=function(x){c(N=NROW(x),MA=mean(x),SD=sd(x),MD=median(x),MN=min(x),MX=max(x))})';
      put 'r1$IDVar <- "AGEYR"  #Add analysis variable to first column';
      put 'names(r1) <- c("SEX","N","MEAN","STDDEV","MEDIAN","MIN","MAX","IDVar") #Do renaming of vars';

      put '#Result calculation for HTCM by SEX';
      put 'r2<-summaryBy(HTCM~SEX,data=mydata,FUN=function(x){c(N=NROW(x),MA=mean(x),SD=sd(x),MD=median(x),MN=min(x),MX=max(x))})';
      put 'r2$IDVar <- "HTCM"';
      put 'names(r2) <- c("SEX","N","MEAN","STDDEV","MEDIAN","MIN","MAX","IDVar")';

      put '#Result calculation for WTKG by SEX';
      put 'r3<-summaryBy(WTKG~SEX,data=mydata,FUN=function(x){c(N=NROW(x),MA=mean(x),SD=sd(x),MD=median(x),MN=min(x),MX=max(x))})';
      put 'r3$IDVar <- "WTKG"';
      put 'names(r3) <- c("SEX","N","MEAN","STDDEV","MEDIAN","MIN","MAX","IDVar")';

      put '#Combine the three calculations into one data frame';
      put 'r <- rbind(r1,r2,r3)';

      put '#Write the results out as a CSV file';
      put 'write.csv(r, file = "c:/temp/rdata_demog.csv", row.names=FALSE)';

      put 'q() #Exit';
   RUN;

Now it is running this PROGRAM.R file using R in batch mode, as seen below:

   OPTIONS XWAIT XSYNC;
   DATA _null_;
      LENGTH _str $200;
      *Location for R version 3.0.1, calling PROGRAM.R and output a LOG to PROGRAM.LOG;
      _str='"C:\Program Files\R\R-3.0.1\bin\r.exe" --no-save --quiet < "c:\temp\program.r" > "c:\temp\program.log"';
      CALL SYSTEM(_str);
   RUN;

Running these two datasteps in SAS or WPS create a CSV file with the N, mean, standard deviation, median, minimum and maximum values for each SEX value for age, height and weight. This CSV file can then be read in using a datastep or PROC IMPORT, among others, and processed in SAS/WPS. Yes, we can get the same result using PROC MEANS, PROC SUMMARY or PROC UNIVARITE but it is useful to see want the same can be done in R.

It must be noted that while I present a solution here that writes an external file and then has another step to run that code in R, there is a product called Bridge to R for WPS developed by MineQuest Business Analytics that can do this for you.

As mentioned previously, R has a large number of specialist packages that help in doing analysis, in this case we have used SASxport to bring the data into an R data frame, and use a doBy package to do analysis by a variable, in this case SEX. Note that comments are preceded by a '#' character.

One very important thing to note is that variable and data frame names are case sensitive -- case does matter in R!

Now for the case where the statistic cannot be calculated in SAS/WPS, getting the quartiles based on the Hyndman and Fan definition (yes, I have a paper out there called "Oh Quartile, Where Art Thou?" than shows a macro to do it, but we are using R instead for this example). Again we shall use the same data we brought in above:

   *Create PROGRAM.R file;
   DATA _null_;
      FILE 'c:\temp\program.r' LRECL=1024;

      put '#Bring in additional packages needed for analysis';
      put 'library(SASxport)';

      put '#Read in data -- note "/" for directory names, even in Windows';
      put 'mydata = read.xport("c:/temp/demogxpt.xpt")';

      put '#Subset data into Male and Female';
      put 'datasexF <- mydata[ which(mydata$SEX=='F'), ]';
      put 'datasexM <- mydata[ which(mydata$SEX=='M'), ]';

      put '#Female Weight';
      put 'r1 <- quantile(datasexF$WTKG,  probs = c(0.25, 0.5, 0.75), type=8)';
      put 'r1$IDVar <- "WTKG"';
      put 'r1$SEX <- "F"';

      put '#Male Weight';
      put 'r2 <- quantile(datasexM$WTKG,  probs = c(0.25, 0.5, 0.75), type=8)';
      put 'r2$IDVar <- "WTKG"';
      put 'r2$SEX <- "M"';

      put '#Combine the two calculations into one data frame';
      put 'r <- rbind(r1,r2)';

      put '#Write the results out as a CSV file';
      put 'write.csv(r, file = "c:/temp/rdata_demog_quantile.csv", row.names=FALSE)';

      put 'q() #Exit';
   RUN;

If we run this newly created file with the datastep containing the CALL SYSTEM call we get a CSV file called rdata_demog_quantile with the quartile results for the 1st, 2nd and 3rd quartiles (quintile of 0.25 = first quartile, 0.50 = second quartile or median, and 0.75 = third quartile) that we can import this CSV file into SAS/WPS using a datastep or PROC IMPORT, among others.

In the two examples given above we have managed to write the output to a CSV file for import into SAS/WPS, however there are occasions when a data frame is not created so it is necessary to capture the output as a text file and process that file inside SAS/WPS. The following R code illustrates this:

   out<-capture.output(r_code)
   cat(out,file="c:/temp/out.txt",sep="\n",append=FALSE) #First result, do not append
   out<-capture.output(more_r_code)
   cat(out,file="c:/temp/out.txt",sep="\n",append=TRUE) #Second result, do append

People who promote R use the ease at which graphs can be created as one of the reasons for why R is better than SAS/WPS -- I am not going to go into that debate as it can be argued for a very long time. What I will do is create a simple graph where the output is in JPG format:

   #Bring in additional packages needed for analysis
   library(SASxport)

   #Read in data -- note "/" for directory names, even in Windows
   mydata = read.xport("c:/temp/demogxpt.xpt")

   jpeg(file='c:/temp/SexWtKg.jpg', width=640, height=480) #Setup output JPEG file
   z <- mydata  #Copy mydata to z

   #Recode SEX variables
   attach(z)
   z$SEXCAT[SEX=="F"] <- "Female" # Relabel F for Female
   z$SEXCAT[SEX=="M"] <- "Male" # Relabel M for Male
   detach(z)

   #Mean and Median for weight by sex;
   m1 <- tapply(z$WTKG, list(z$SEXCAT), mean)
   m2 <- tapply(z$WTKG, list(z$SEXCAT), median)

   #Combine the two calculations into one data frame
   r <- rbind(m1,m2)

   #Generate graph
   barplot(r,col=c("yellow","light green"),ylim=c(0,150),beside=T,main="Mean and Median Weights by Gender",ylab="Weight (kg)",xlab="Sex")
   legend("topright",c("Mean","Median"),col=c("yellow","light green"),pch=15)

   dev.off() #Switch off device for creating graph

Again, we can wrap this in a datastep and call the R code in a datastep as shown previously.

For changing the output to a PDF format instead of JPEG, we could change the

   jpeg(file='c:/temp/SexWtKg.jpg', width=640, height=480) #Setup output JPEG file

for

   pdf(file="C:/temp/SexWtKg.pdf", height=6.5, width=9, paper="USr") #Setup output PDF file

This months' tip has only touched the surface of R. Is SAS/WPS better than R? This is a question that I am not prepared to get involved in, but R is certainly a package that complements SAS and WPS -- the two software packages can co-exist together achieving the output that is desired.

Hope this is useful. See you in August.

________________________________
Updated July 9, 2013