Number of Obs in a Dataset

Using PROC SUMMARY for Descriptive and Frequency Statistics

Last Date of Month

SAS to CSV

Does a Dataset Exist

Reordering Variables

Additional Codes to an Existing Format

Concatenating Datasets

Deleting SAS Datasets based on a Date

Reading Variable Length Record Files

Changing the Height of a HEADLINE in PROC REPORT when using ODS RTF

How Quantiles are Calculated

Variance Calculation Differences

Getting the Comment text of an Excel Cell

Getting the Background Color of a Cell in Excel

A DOS .BAT File for Backing Up a File with Date A Stamp

Calculating the Distance Between Two Points on the Earth's Surface

Return to Homepage


Other interesting pages ...
SAS Cheat Sheet
SAS Tip of the Month
Full SAS Example
Basic Statistics
Contact Information

The following is some of my more useful SAS Tips, or other pieces of code that may or may not be directly SAS related, that I have collected over time. Some may have been seen in the Tip of the Month page, although it must be noted that only some pass from that page to this one, and others are just tidbits that are interesting and/or useful. There is no order to these nuggets so please just look carefully at the index on the left for what is available as it is updated from time to time.

Counting the Number of Observations in a Dataset

There are a number of ways that SAS code can be written to get the number of observations in a SAS dataset. My favorite and an oldie is:

            %macro numobs(dsn);
              %global num;
              data _null_;
                if 0 then set &dsn nobs=nobs;
                call symput('num',trim(left(put(nobs,8.))));
                stop;
              run;
            %mend numobs;

Another adaption of this counts not only the number of observations but also the number of variables in a dataset:

            %macro numobs(dsn);
              %global numobs numvars;
              data _null_;
                set &dsn (obs=1) NOBS=obscnt;
                array aa {*} $ _character_;
                array nn {*} _numeric_;
                vars=dim(aa)+dim(nn);
                call symput('numvars',vars);
                call symput('numobs',obscnt);
                stop;
              run;
            %mend numobs;

Version SAS 6.12 introduced the BASE SAS programmer to some of the SAS functions that were only available in SCL. One of these was the ATTRN function and this was able to be used to get the number of observations with the following code:

            %macro numobs(dsn);
              %global num;
              %let dsid=%sysfunc(open(&dsn));
              %if &dsid %then %do;
                %let num=%sysfunc(attrn(&dsid,nobs));
                %let rc=%sysfunc(close(&dsid));
              %end;
              %else %put Opening dataset &dsn failed - %sysfunc(sysmsg());
            %mend numobs;

In both cases calling the macro counts the number of observations inside the dataset specified by the macro variable DSN and puts the number found in the global macro variable NUM. Which one is best is down to personal preference but remember that the latter can only be used where you have access to SAS version 6.12 or above.


Using PROC SUMMARY for Descriptive and Frequency Statistics

To many, the PROC MEANS and PROC SUMMARY SAS procedures are the same. There are however two differences.

The first difference is that the SUMMARY procedure does have as default to print no output to an output file while the MEANS procedure does by default. The option that controls this is PRINT/NOPRINT so it is possible to print the output from the SUMMARY procedure and have no output from the MEANS procedure.

The second difference is not widely known but it is a useful. When the VAR statement is missing in the MEANS procedure analysis is carried out on all numeric variables, as shown in the following example (output below):

    data vitals;
        infile cards;
        input patid $3. heart_rate
              temperature trtcd $1.;
    cards;
    001 72 35.8 A
    002 80 36.4 B
    003 99 36.6 A
    ;
    run;

    proc means data=vitals nway;
        class trtcd;
    run;

                                     The SAS System
                                   The MEANS Procedure

             N
    trtcd  Obs  Variable     N        Mean     Std Dev     Minimum     Maximum
    --------------------------------------------------------------------------
    A        2  heart_rate   2  85.5000000  19.0918831  72.0000000  99.0000000
                temperature  2  36.2000000   0.5656854  35.8000000  36.6000000

    B        1  heart_rate   1  80.0000000           .  80.0000000  80.0000000
                temperature  1  36.4000000           .  36.4000000  36.4000000
    --------------------------------------------------------------------------

However, what happens when the SUMMARY procedure is used instead on the same data?

    proc summary data=vitals nway print;
        class trtcd;
    run;

          The SAS System
       The SUMMARY Procedure

                    N
          trtcd    Obs
          ------------
          A          2
          B          1
          ------------

Notice that the only result that came out from the SUMMARY procedure was the number of observations from each treatment group, similar to what the FREQ procedure will produce. The same result will prevail if the only variables in the VITALS dataset were PATID and TRTCD, that is only the character variables.

So what does this finding mean? In most cases SAS programmers use the FREQ procedure for frequency counts and the MEANS procedure for summary statistics however these two sets of statistics most commonly carried out can be done by one procedure. A possible macro for calculating both from one procedure is given below:

    %macro summstat(dsin=,
                      /*Input file -REQUIRED*/
                    dsout=,
                      /*Results filst -REQUIRED*/
                    classvar=,
                      /*Class Variable(s) -REQUIRED*/
                    vvars=,
                      /*Analysis Variables, only if
                       descriptive statistics
                       requested. Can use one or
                       numeric variable names or
                       _NUMERIC_ for all numeric
                       variables.  AUTONAME option will
                       set variable names in output
                       file*/
                    outprt=NOPRINT
                      /*Output to listings file.
                       Values: PRINT|NOPRINT*/
                   );
        proc summary data=&dsin nway &outprt;
            class &classvar;
            %if (&vvars ne ) %then %do;
              var &vvars;
              output out=&dsout (drop=_type_ _freq_)
                     n= mean= std= median= min= max=
                     /autoname;
            %end;
            %else %do;
                output out=&dsout
                        (drop=_type_
                          rename=(_freq_=N));
            %end;
        run;
    %mend summstat;

Finding the Last date of the Month

Sometimes it is necessary to find the last date of a month. Remembering that the last date for a month is not the same across all months of the year the following code will help:

            Last_Day_of_Month=INTNX('MONTH',SAS_Date,0,'END');

Note the use of the INTNX function which is useful for time interval calculations.


Creating a CSV file of a SAS Dataset

Passing a SAS dataset to Excel? Transferring data from one application to another is always a tricky issue. One of the usual ways this is done is converting the data into a common format that both applications can read and write - CSV is one such established format.

There are a number of ways that SAS will create the CSV file. The first is using the PROC EXPORT procedure and using the CSV as the filename extension, as the following example shows:

    PROC EXPORT DATA=sashelp.class
        OUTFILE="c:\taui\class.csv";
    RUN;

Before the EXPORT procedure came into SAS, the data step was used to generate the CSV file, as the following example shows:

    DATA _NULL_;
        FILE "c:\tuai\class.csv";
        SET sashelp.class;
        PUT (_all_) (',');
    RUN;

Also available is the DEXPORT command that is run inside the command line of the SAS Desktop, which is shown in the following example:

    DEXPORT sashelp.class "c:\taui\class.csv"

The last method that will be looked at here is using a form of SAS ODS that was introduced in SAS 8.2 - the best way to show it is using the following example:

    ODS CSV FILE='c:\taui\class.csv';
    PROC PRINT DATA = sashelp.class NOOBS;
    RUN;
    ODS CSV CLOSE;

With this usage it is possible to use the standard VAR and WHERE statements to limit the observations and variables being passed to the CSV file.

CSV files can be easily read into Excel or most other spreadsheet or database programs and is a useful format when transferring data from SAS to another application.


Checking a SAS Dataset Exists

Want to check if a particular SAS dataset exists? If you have SAS version 6.12 or above the following code fragment will be useful:

            exist = %SYSFUNC(EXIST(dataset_name));

where the variable EXIST will contain a value of 1 if the dataset is present or 0 if not.


Reordering Variables

Volumes have been written in the past on how to reorder variables in a SAS Dataset. There is the LENGTH, ATTRIB and retain statements inside a SAS dataset but I have found in the past that the best way is the use of the SQL procedure. Lets say you have a SAS dataset DEMOG with the variables AGE, GENDER, SUBJECTID, HEIGHT and WEIGHT, and you want the variable SUBJECTID first (the placement of the other variables is okay), the following code is useful:

            PROC SQL;
                CREATE TABLE demog AS
                    SELECT subjectid, *
                    FROM demog;
                QUIT;
            RUN;

The resulting dataset DEMOG will have the variables in the order of SUBJECTID, AGE, GENDER, HEIGHT and WEIGHT.


Additional Codes to an Existing Format

Your program has a variable that is coded as 1=YES and 2=NO with an associated format called YN. Now there is a new value of 3=DON'T KNOW in your data but the format YN has not been changed. In your reporting program you can create another format, in this example YNX, using the following code:

            proc format;
                value ynx
                    1='YES'
                    2='NO'
                    3="DON'T KNOW";
            run;

A better way though, avoiding transcription problems, is to make the new YNX format by nesting the old format YN into the new as shown in the following code:

            proc format;
                value ynx
                    3="DON'T KNOW"
                    other=[yn.];
            run;

Note that you must enclose the existing format name in square brackets, as shown above, or with parentheses and vertical bars, for example (|yn.|).


Concatenating Datasets

Concatenating datasets can be one of the most tricky activities in SAS as you have to know the structure and content of the dataset before this task is done. However, one trick that is very useful is to use SQL to do the concatenation therefore avoiding a number of the problems associated with similar tasks using either the DATA step or APPEND procedure. The following example shows the SQL code used:

            PROC SQL;
                CREATE TABLE outa AS
                    SELECT *, 'in1' AS dset
                    FROM in1
                            OUTER UNION CORR
                    SELECT *, 'in2' AS dset
                    FROM in2
                QUIT;
            RUN;

Deleting SAS Datasets based on a Date

Occasionally it is necessary to cleanup old datasets in a directory based on a date. The following SAS code is an example where datasets in the directory referenced by the LIBNAME OLDDATA that are older than 15MAR2007 are deleted:

    %let dslist=%str();
    proc sql;
        select memname into :dslist separated by ' '
          from sashelp.vtable
          where libname='OLDDATA' and datepart(crdate) < "15MAR2007"d;
        quit;
    run;
    proc datasets  library=OLDDATA nolist nodetails;
        delete &dslist;
        quit;
    run;

This code can be modified to any datasets in a specified directory as well as and date and/or time that the user may choose.


Reading Variable Length Record Files

Ever try to read in a file with variable length records? The following example may be useful and looks at the file INTEXT.TXT which as columns 1-10 as a FLAG for the record and columns 11 onwards as input text:

            DATA a;
              INFILE 'intext.txt' LENGTH=len;
              INPUT flag 1-10 @;
              varlen=len-10;
              INPUT @11 text $VARYING200. varlen;
            RUN;

Changing the Height of a HEADLINE in PROC REPORT when using ODS RTF

Sometimes when creating a report using ODS RTF in SASv8.2, the line under the columns using the HEADLINE option in the REPORT procedure is too thin to be displayed properly on a computer screen but yet can be seen on a printout. The simple reason is that most computer screens use a display width of 72 cells per inch while the basic of printers have the capability of 300 or more cells per inch. To alter the height of the headline line some RTF code has to be sent to the RTF file being created. The following illustrates the usage:

  %let hdrbrdr=%str(style(header)=[pretext="\brdrb\brdrs\brdrw30"]);
  %let hdropt =%str(style(header column)=[protectspecialchars=off]);

  proc report data=pop1 nowindows headline &hdropt split='\' missing;
      columns state pop landarea psqmile;
      define state    /group
                       order=internal
                       style=[cellwidth=80 just=left]
                       &hdrbrdr
                       'State';
      define pop      /display
                       style=[cellwidth=30 just=center]
                       &hdrbrdr
                       'Population\(2003)'
                       format=comma11.;
      define landarea /display
                       style=[cellwidth=30 just=center]
                       &hdrbrdr
                       'Land Area\(sq. miles)'
                       format=comma11.;
      define psqmile  /display
                       style=[cellwidth=30 just=center]
                       &hdrbrdr 'People per square mile'
                       format=comma11.1;
      quit;
  run;

In the example above it is the HRDBRDR macro variable that controls the width of the line, specifically the number at the end of the definition. Note that in this example the width is set to '30' twips but can be altered to any value.


How Quantiles are Calculated

The way a statistic is calculated may be more important than the result it produces. Recently an example showed up when some statistics were being checked using Excel on results produced with SAS, specifically with the calculation of a first and third quartile, also known as the 25th and 75th percentile.

For the data 1, 2, 3, 4, 5, 6, 7 and 8 the following results are calculated for the 25th percentile:

SAS Method 5 (default) = 2.5
SAS Method 4 = 2.25
Excel = 2.75

Why the difference? There actually is no standard for the calculation of percentile and it does depend on what a statistician is looking for. SAS has six methods of calculating the percentile, the two common ones being:

Method 5: y = (xj - xj+1)/2 if g=0 or y = xj+1 if g>0, where n*p=j+g
Method 4: y = (1-g)*xj + g*xj+1, where (n+1)*p=j+g and xn+1 is taken to be xn

Excel, S-Plus and StarOffice Calc by comparison uses a different method, specifically:

y = (1-g)*xj+1+g*xj+2 where (n-1)*p=j+g, and both xn+1 and xn+2 is taken to be xn

Among the major statistical software packages only Excel, S-Plus and StarOffice Calc use this method. For those using Minitab or SPSS they use the SAS Method 4 for their calculation.

The moral of this example is that you should know how your software calculates a statistic before blindly reporting the result.


Variance Calculation Differences

I received an email in early 2005 from a SAS programmer asking me if I could help explain why he was getting differences when comparing the variance results between SAS and Excel using the same set of numbers.

Traditionally the variance is calculated as

(a)

Alternately the equation can be written as

(b)

The first equation (a) requires two passes of the data - the first pass to calculate the mean and the second to calculate the variance.

This second equation (b) is more commonly known as the Desktop Calculator Formula as it is possible to calculate the variance with one pass of the data. There is one issue with this formula as if the numbers are big and the differences between the numbers are small an incorrect result will prevail. This is because computers and calculators store results as real numbers and some precision may be lost when storing the sum of the square values across a long list of numbers. As an example calculate the variance of 10,000,001, 10,000,003 and 10,000,005 - the answer is 4 but some calculators may produce an answer of 0.

Now what about SAS and Excel, what do they do. SAS uses the Traditional Formula. Microsoft used the Desktop Calculator Formula until Excel 2003 (for Windows) and Excel (for Mac) when it changed to use the Traditional Formula1. The Desktop Calculator Formula appears to have been used by Microsoft as it was faster than doing the Traditional Formula, the latter requiring two passes of the data.

Is there any calculations that will produce a result with one pass of data? The answer is yes. One formula is known as the Method of Provisional Means where

and

There are other methods that calculate the variance in one pass of the data that are found in some statistical textbooks.

The moral of the story is to always check with your software documentation to see how statistics are being calculated.

--------------
1 Reference Microsoft Knowledgebase, article 828888, dated March 10, 2005


Getting the Comment text of an Excel Cell

Not a SAS tip but one that is useful within Excel. The following VBA can be used to get the text of a comment behind a cell in an Excel Spreadsheet and place it in another specified cell:

  Function GComTxt(rCellComment As Range)
    On Error Resume Next
    GComTxt = WorksheetFunction.Clean(rCellComment.Comment.Text)
    On Error GoTo 0
  End Function

To enter the VBA code into Excel select

  Tools > Macro > Visual Basic Editor

then

  Insert > Module

Then paste the code above and then press

  File > Close and Return to Microsoft Excel

The function is now ready for use within Excel and can be accessed from the User Defined function list.


Getting the Background Color of a Cell in Excel

Again, not a SAS tip but one that is useful within Excel. The following VBA can used to get the background color a cell in an Excel Spreadsheet and place it in another specified cell:

            Function showColorCode(rcell)
                showColorCode = rcell.Interior.ColorIndex
            End Function

To enter the VBA code into Excel select

            Tools > Macro > Visual Basic Editor

then

            Insert > Module

Then paste the code above and then press

            File > Close and Return to Microsoft Excel

The function is now ready for use within Excel and can be accessed from the User Defined function list.

The codes returned from the function will show the background color of the cell. Microsoft unfortunately did not keep codes consistent across different releases of Excel. The following list is a general guide that is useful when referring to color codes for Excel 97:

  • -4142 = No Color
  • 1 = Black
  • 2 = White
  • 3 = Red
  • 5 = Blue
  • 6 = Yellow
  • 10 = Green

A DOS .BAT File for Backing Up a File with Date A Stamp

This is not a direct SAS tip but a DOS .BAT file that I find useful when writing programs and wanting to keep a copy of a file with a date/time stamp embedded in the file name, in a backup directory, directly under the directory I am working in. The code for this file is the following:

@ECHO OFF
if z%1 == z goto usage
If not exist bkup\Nul MD bkup
FOR /F "TOKENS=2-4 DELIMS=/ " %%F IN ('DATE /T') DO (SET TODAY=%%F-%%G-%%H)
FOR /F "TOKENS=1-2 DELIMS=: " %%I IN ('TIME /T') DO (SET NOW=%%I%%J)
copy %~n1%~x1 bkup\%~n1-%today%-%now%%~x1
goto end
:usage
echo. Usage is BKUP filename.
:end

The usage for this file, where the code is stored in the file BKUP.BAT, is

BKUP filename

An example of its usage is the file DEMO.SAS where after running the command

BKUP DEMO.SAS

the file would appear in the BKUP directory with the name DEMO-01-13-2004-1238.SAS where the "01-13-2004" is the date in local format from the DOS country setting and the "1238" is the time. This code will only work when your operating system is Windows NT, Windows 2000 or Windows XP - it will not work in any other environment. Note that if the directory BKUP is not present then it will be created.


Calculating the Distance Between Two Points on the Earth's Surface

Not actually SAS related but here is something that may be useful to someone.

Before GPS could tell us what the distance was between two points on the Earth's Surface there were many ways that were used to estimate that distance. One of the most simplist methods was the Great Circle Distance which was to treat the earth as a sphere and use the following formula:

d = 60 * ARCOS ( SIN(l1) * SIN(l2) + COS(l1) * COS(l2) * COS(g2 - g1))

where

l1 = latitude at the first point (degrees)
l2 = latitude at the second point (degrees)
g1 = longitude at the first point (degrees)
g2 = longitude at the second point (degrees)
d = computed distance (nautical miles)

This method of calculation does assume that 1 minute of arc is 1 nautical mile. To convert the distance from nautical miles to kilometers, multiply the result d by 1.852. Similarly to convert the distance from nautical miles to standard miles multiply the result by 1.150779. Note that if your calculator returns the ARCOS result as radians you will have to convert the radians to degrees before multiplying by 60, i.e. where degrees = (radians/PI)*180, where PI is approximately 3.141592654.


Updated January 11, 2011