Using PROC SUMMARY for Descriptive and Frequency Statistics Additional Codes to an Existing Format Deleting SAS Datasets based on a Date Reading Variable Length Record Files Changing the Height of a HEADLINE in PROC REPORT when using ODS RTF Variance Calculation Differences Getting the Comment text of an Excel Cell Getting the Background Color of a Cell in Excel A DOS .BAT File for Backing Up a File with Date A Stamp Calculating the Distance Between Two Points on the Earth's Surface Other interesting pages ... |
The following is some of my more useful SAS Tips, or other pieces of code that may or may not be directly SAS related, that I have collected over time. Some may have been seen in the Tip of the Month page, although it must be noted that only some pass from that page to this one, and others are just tidbits that are interesting and/or useful. There is no order to these nuggets so please just look carefully at the index on the left for what is available as it is updated from time to time. ## Counting the Number of Observations in a DatasetThere are a number of ways that SAS code can be written to get the number of observations in a SAS dataset. My favorite and an oldie is: %macro numobs(dsn); %global num; data _null_; if 0 then set &dsn nobs=nobs; call symput('num',trim(left(put(nobs,8.)))); stop; run; %mend numobs; Another adaption of this counts not only the number of observations but also the number of variables in a dataset: %macro numobs(dsn); %global numobs numvars; data _null_; set &dsn (obs=1) NOBS=obscnt; array aa {*} $ _character_; array nn {*} _numeric_; vars=dim(aa)+dim(nn); call symput('numvars',vars); call symput('numobs',obscnt); stop; run; %mend numobs; Version SAS 6.12 introduced the BASE SAS programmer to some of the SAS functions that were only available in SCL. One of these was the ATTRN function and this was able to be used to get the number of observations with the following code: %macro numobs(dsn); %global num; %let dsid=%sysfunc(open(&dsn)); %if &dsid %then %do; %let num=%sysfunc(attrn(&dsid,nobs)); %let rc=%sysfunc(close(&dsid)); %end; %else %put Opening dataset &dsn failed - %sysfunc(sysmsg()); %mend numobs; In both cases calling the macro counts the number of observations inside the dataset specified by the macro variable DSN and puts the number found in the global macro variable NUM. Which one is best is down to personal preference but remember that the latter can only be used where you have access to SAS version 6.12 or above. ## Using PROC SUMMARY for Descriptive and Frequency StatisticsTo many, the PROC MEANS and PROC SUMMARY SAS procedures are the same. There are however two differences. The first difference is that the SUMMARY procedure does have as default to print no output to an output file while the MEANS procedure does by default. The option that controls this is PRINT/NOPRINT so it is possible to print the output from the SUMMARY procedure and have no output from the MEANS procedure. The second difference is not widely known but it is a useful. When the VAR statement is missing in the MEANS procedure analysis is carried out on all numeric variables, as shown in the following example (output below): data vitals; infile cards; input patid $3. heart_rate temperature trtcd $1.; cards; 001 72 35.8 A 002 80 36.4 B 003 99 36.6 A ; run; proc means data=vitals nway; class trtcd; run; The SAS System The MEANS Procedure N trtcd Obs Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------- A 2 heart_rate 2 85.5000000 19.0918831 72.0000000 99.0000000 temperature 2 36.2000000 0.5656854 35.8000000 36.6000000 B 1 heart_rate 1 80.0000000 . 80.0000000 80.0000000 temperature 1 36.4000000 . 36.4000000 36.4000000 -------------------------------------------------------------------------- However, what happens when the SUMMARY procedure is used instead on the same data? proc summary data=vitals nway print; class trtcd; run; The SAS System The SUMMARY Procedure N trtcd Obs ------------ A 2 B 1 ------------ Notice that the only result that came out from the SUMMARY procedure was the number of observations from each treatment group, similar to what the FREQ procedure will produce. The same result will prevail if the only variables in the VITALS dataset were PATID and TRTCD, that is only the character variables. So what does this finding mean? In most cases SAS programmers use the FREQ procedure for frequency counts and the MEANS procedure for summary statistics however these two sets of statistics most commonly carried out can be done by one procedure. A possible macro for calculating both from one procedure is given below: %macro summstat(dsin=, /*Input file -REQUIRED*/ dsout=, /*Results filst -REQUIRED*/ classvar=, /*Class Variable(s) -REQUIRED*/ vvars=, /*Analysis Variables, only if descriptive statistics requested. Can use one or numeric variable names or _NUMERIC_ for all numeric variables. AUTONAME option will set variable names in output file*/ outprt=NOPRINT /*Output to listings file. Values: PRINT|NOPRINT*/ ); proc summary data=&dsin nway &outprt; class &classvar; %if (&vvars ne ) %then %do; var &vvars; output out=&dsout (drop=_type_ _freq_) n= mean= std= median= min= max= /autoname; %end; %else %do; output out=&dsout (drop=_type_ rename=(_freq_=N)); %end; run; %mend summstat; ## Finding the Last date of the MonthSometimes it is necessary to find the last date of a month. Remembering that the last date for a month is not the same across all months of the year the following code will help: Last_Day_of_Month=INTNX('MONTH',SAS_Date,0,'END'); Note the use of the INTNX function which is useful for time interval calculations. ## Creating a CSV file of a SAS DatasetPassing a SAS dataset to Excel? Transferring data from one application to another is always a tricky issue. One of the usual ways this is done is converting the data into a common format that both applications can read and write - CSV is one such established format. There are a number of ways that SAS will create the CSV file. The first is using the PROC EXPORT procedure and using the CSV as the filename extension, as the following example shows: PROC EXPORT DATA=sashelp.class OUTFILE="c:\taui\class.csv"; RUN; Before the EXPORT procedure came into SAS, the data step was used to generate the CSV file, as the following example shows: DATA _NULL_; FILE "c:\tuai\class.csv"; SET sashelp.class; PUT (_all_) (','); RUN; Also available is the DEXPORT command that is run inside the command line of the SAS Desktop, which is shown in the following example: DEXPORT sashelp.class "c:\taui\class.csv" The last method that will be looked at here is using a form of SAS ODS that was introduced in SAS 8.2 - the best way to show it is using the following example: ODS CSV FILE='c:\taui\class.csv'; PROC PRINT DATA = sashelp.class NOOBS; RUN; ODS CSV CLOSE; With this usage it is possible to use the standard VAR and WHERE statements to limit the observations and variables being passed to the CSV file. CSV files can be easily read into Excel or most other spreadsheet or database programs and is a useful format when transferring data from SAS to another application. ## Checking a SAS Dataset ExistsWant to check if a particular SAS dataset exists? If you have SAS version 6.12 or above the following code fragment will be useful: exist = %SYSFUNC(EXIST(dataset_name)); where the variable EXIST will contain a value of 1 if the dataset is present or 0 if not. ## Reordering VariablesVolumes have been written in the past on how to reorder variables in a SAS Dataset. There is the LENGTH, ATTRIB and retain statements inside a SAS dataset but I have found in the past that the best way is the use of the SQL procedure. Lets say you have a SAS dataset DEMOG with the variables AGE, GENDER, SUBJECTID, HEIGHT and WEIGHT, and you want the variable SUBJECTID first (the placement of the other variables is okay), the following code is useful: PROC SQL; CREATE TABLE demog AS SELECT subjectid, * FROM demog; QUIT; RUN; The resulting dataset DEMOG will have the variables in the order of SUBJECTID, AGE, GENDER, HEIGHT and WEIGHT. ## Additional Codes to an Existing FormatYour program has a variable that is coded as 1=YES and 2=NO with an associated format called YN. Now there is a new value of 3=DON'T KNOW in your data but the format YN has not been changed. In your reporting program you can create another format, in this example YNX, using the following code: proc format; value ynx 1='YES' 2='NO' 3="DON'T KNOW"; run; A better way though, avoiding transcription problems, is to make the new YNX format by nesting the old format YN into the new as shown in the following code: proc format; value ynx 3="DON'T KNOW" other=[yn.]; run; Note that you must enclose the existing format name in square brackets, as shown above, or with parentheses and vertical bars, for example (|yn.|). ## Concatenating DatasetsConcatenating datasets can be one of the most tricky activities in SAS as you have to know the structure and content of the dataset before this task is done. However, one trick that is very useful is to use SQL to do the concatenation therefore avoiding a number of the problems associated with similar tasks using either the DATA step or APPEND procedure. The following example shows the SQL code used: PROC SQL; CREATE TABLE outa AS SELECT *, 'in1' AS dset FROM in1 OUTER UNION CORR SELECT *, 'in2' AS dset FROM in2 QUIT; RUN; ## Deleting SAS Datasets based on a DateOccasionally it is necessary to cleanup old datasets in a directory based on a date. The following SAS code is an example where datasets in the directory referenced by the LIBNAME OLDDATA that are older than 15MAR2007 are deleted: %let dslist=%str(); proc sql; select memname into :dslist separated by ' ' from sashelp.vtable where libname='OLDDATA' and datepart(crdate) < "15MAR2007"d; quit; run; proc datasets library=OLDDATA nolist nodetails; delete &dslist; quit; run; This code can be modified to any datasets in a specified directory as well as and date and/or time that the user may choose. ## Reading Variable Length Record FilesEver try to read in a file with variable length records? The following example may be useful and looks at the file INTEXT.TXT which as columns 1-10 as a FLAG for the record and columns 11 onwards as input text: DATA a; INFILE 'intext.txt' LENGTH=len; INPUT flag 1-10 @; varlen=len-10; INPUT @11 text $VARYING200. varlen; RUN; ## Changing the Height of a HEADLINE in PROC REPORT when using ODS RTFSometimes when creating a report using ODS RTF in SASv8.2, the line under the columns using the HEADLINE option in the REPORT procedure is too thin to be displayed properly on a computer screen but yet can be seen on a printout. The simple reason is that most computer screens use a display width of 72 cells per inch while the basic of printers have the capability of 300 or more cells per inch. To alter the height of the headline line some RTF code has to be sent to the RTF file being created. The following illustrates the usage: %let hdrbrdr=%str(style(header)=[pretext="\brdrb\brdrs\brdrw30"]); %let hdropt =%str(style(header column)=[protectspecialchars=off]); proc report data=pop1 nowindows headline &hdropt split='\' missing; columns state pop landarea psqmile; define state /group order=internal style=[cellwidth=80 just=left] &hdrbrdr 'State'; define pop /display style=[cellwidth=30 just=center] &hdrbrdr 'Population\(2003)' format=comma11.; define landarea /display style=[cellwidth=30 just=center] &hdrbrdr 'Land Area\(sq. miles)' format=comma11.; define psqmile /display style=[cellwidth=30 just=center] &hdrbrdr 'People per square mile' format=comma11.1; quit; run; In the example above it is the HRDBRDR macro variable that controls the width of the line, specifically the number at the end of the definition. Note that in this example the width is set to '30' twips but can be altered to any value. ## How Quantiles are CalculatedThe way a statistic is calculated may be more important than the result it produces.
Recently an example showed up when some statistics were being checked using Excel on
results produced with SAS, specifically with the calculation of a first and third
quartile, also known as the 25 For the data 1, 2, 3, 4, 5, 6, 7 and 8 the following results are calculated for the
25 SAS Method 5 (default) = 2.5 Why the difference? There actually is no standard for the calculation of percentile and it does depend on what a statistician is looking for. SAS has six methods of calculating the percentile, the two common ones being: Method 5: y = (x Excel, S-Plus and StarOffice Calc by comparison uses a different method, specifically: y = (1-g)*x Among the major statistical software packages only Excel, S-Plus and StarOffice Calc use this method. For those using Minitab or SPSS they use the SAS Method 4 for their calculation. The moral of this example is that you should know how your software calculates a statistic before blindly reporting the result. ## Variance Calculation DifferencesI received an email in early 2005 from a SAS programmer asking me if I could help explain why he was getting differences when comparing the variance results between SAS and Excel using the same set of numbers. Traditionally the variance is calculated as (a) Alternately the equation can be written as (b) The first equation (a) requires two passes of the data - the first pass to calculate the mean and the second to calculate the variance. This second equation (b) is more commonly known as the Desktop Calculator Formula as it is possible to calculate the variance with one pass of the data. There is one issue with this formula as if the numbers are big and the differences between the numbers are small an incorrect result will prevail. This is because computers and calculators store results as real numbers and some precision may be lost when storing the sum of the square values across a long list of numbers. As an example calculate the variance of 10,000,001, 10,000,003 and 10,000,005 - the answer is 4 but some calculators may produce an answer of 0. Now what about SAS and Excel, what do they do. SAS uses the
Traditional Formula. Microsoft used the Desktop Calculator
Formula until Excel 2003 (for Windows) and Excel (for Mac) when
it changed to use the Traditional Formula Is there any calculations that will produce a result with one pass of data? The answer is yes. One formula is known as the Method of Provisional Means where and There are other methods that calculate the variance in one pass of the data that are found in some statistical textbooks. The moral of the story is to always check with your software documentation to see how statistics are being calculated. -------------- ## Getting the Comment text of an Excel CellNot a SAS tip but one that is useful within Excel. The following VBA can be used to get the text of a comment behind a cell in an Excel Spreadsheet and place it in another specified cell: ```
Function GComTxt(rCellComment As Range)
```
To enter the VBA code into Excel select ` Tools > Macro > Visual Basic Editor`
then ` Insert > Module`
Then paste the code above and then press ` File > Close and Return to Microsoft Excel`
The function is now ready for use within Excel and can be accessed from the User Defined function list. ## Getting the Background Color of a Cell in ExcelAgain, not a SAS tip but one that is useful within Excel. The following VBA can used to get the background color a cell in an Excel Spreadsheet and place it in another specified cell: Function showColorCode(rcell) showColorCode = rcell.Interior.ColorIndex End Function To enter the VBA code into Excel select Tools > Macro > Visual Basic Editor then Insert > Module Then paste the code above and then press File > Close and Return to Microsoft Excel The function is now ready for use within Excel and can be accessed from the User Defined function list. The codes returned from the function will show the background color of the cell. Microsoft unfortunately did not keep codes consistent across different releases of Excel. The following list is a general guide that is useful when referring to color codes for Excel 97: - -4142 = No Color
- 1 = Black
- 2 = White
- 3 = Red
- 5 = Blue
- 6 = Yellow
- 10 = Green
## A DOS .BAT File for Backing Up a File with Date A StampThis is not a direct SAS tip but a DOS .BAT file that I find useful when writing programs and wanting to keep a copy of a file with a date/time stamp embedded in the file name, in a backup directory, directly under the directory I am working in. The code for this file is the following:
@ECHO OFF The usage for this file, where the code is stored in the file BKUP.BAT, is BKUP An example of its usage is the file DEMO.SAS where after running the command BKUP DEMO.SAS the file would appear in the BKUP directory with the name DEMO-01-13-2004-1238.SAS where the
"01-13-2004" is the date in local format from the DOS country setting and the "1238" is the
time.
## Calculating the Distance Between Two Points on the Earth's SurfaceNot actually SAS related but here is something that may be useful to someone. Before GPS could tell us what the distance was between two points on the Earth's Surface there were many ways that were used to estimate that distance. One of the most simplist methods was the Great Circle Distance which was to treat the earth as a sphere and use the following formula:
l) + COS(_{2}l) * COS(_{1}l)
* COS(_{2}g- _{2 }g))_{1}where
l = latitude at the
second point (degrees)_{2}g = longitude at the
first point (degrees)_{1}g = longitude at the
second point (degrees)_{2}d = computed distance (nautical
miles)This method of calculation does assume that 1 minute of arc is
1 nautical mile. To convert the distance from nautical miles to
kilometers, multiply the result |

Updated January 11, 2011