David Franklin
Programmers Blog
Back dfranklinnh@gmail.com

The old TheProgrammersCabin.com site had a monthly Tip of the Month starting 2003. This has been reworked with a new vision, cleaned-up, those that were not looked at removed and edited into this blog. Please visit this from time as the blog is updated with new tips.

INDEX

POPULAR

SAS Cheat Sheet

DM Example, Pre ODS

AE and DM Example, ODS RTF

SAS RELATED

Number of Obs in a Dataset

Using PROC SUMMARY for Descriptive and Frequency Statistics

Last Date of Month

SAS to CSV

Does a Dataset Exist

Reordering Variables

Additional Codes to an Existing Format

Concatenating Datasets

Deleting SAS Datasets based on a Date

Reading Variable Length Record Files

Changing the Height of a HEADLINE in PROC REPORT when using ODS RTF

How Quantiles are Calculated

Variance Calculation Differences

WHERE, KEEP/DROP or RENAME -- Which comes first?

Making that Permanent SAS Dataset

Getting rid of non-printable characters

Creating directories within your SAS Program

Using PROC SQL Produce a Listing with Wrapped Lines

Kaplan Meier Survival Table

Creating a DATETIME from DATE and TIME Variables

The LABEL for a Variable

DO Loops

OTHER

Getting the Comment text of an Excel Cell

Getting the Background Color of a Cell in Excel

A DOS .BAT File for Backing Up a File with Date A Stamp

Calculating the Distance Between Two Points on the Earth's Surface

Keep a Notebook handy

SAS Cheat Sheet

All good programmers have a piece of paper with some of the most common things that they do, whether it be a function they cannot remember of a SAS procedure. The one available here is one that I have used for a while now -- it does not have everything but is a good start to just look up "DATA or COMPARE for the PROC COMPARE". Download here and I hope it helps.


Counting the Number of Observations in a Dataset

There are a number of ways that SAS code can be written to get the number of observations in a SAS dataset. My favorite and an oldie is:

  %macro numobs(dsn);
      %global num;
      data _null_;
      if 0 then set &dsn nobs=nobs;
      call symput('num',trim(left(put(nobs,8.))));
      stop;
    run;
  %mend numobs;

Another adaption of this counts not only the number of observations but also the number of variables in a dataset:

  %macro numobs(dsn);
    %global numobs numvars;
    data _null_;
      set &dsn (obs=1) NOBS=obscnt;
      array aa {*} $ _character_;
      array nn {*} _numeric_;
      vars=dim(aa)+dim(nn);
      call symput('numvars',vars);
      call symput('numobs',obscnt);
      stop;
    run;
  %mend numobs;

Version SAS 6.12 introduced the BASE SAS programmer to some of the SAS functions that were only available in SCL. One of these was the ATTRN function and this was able to be used to get the number of observations with the following code:

  %macro numobs(dsn);
    %global num;
    %let dsid=%sysfunc(open(&dsn));
    %if &dsid %then %do;
      %let num=%sysfunc(attrn(&dsid,nobs));
      %let rc=%sysfunc(close(&dsid));
    %end;
    %else %put Opening dataset &dsn failed - %sysfunc(sysmsg());
  %mend numobs;

There is another way that it is possible to do this and that is with PROC SQL as the following code demonstrates:

  %macro numobs(dsn);
    proc sql noprint;
      select count(*) into :num from &dsn;
      quit;
    run;
  %mend numobs;

In all cases calling the macro counts the number of observations inside the dataset specified by the macro variable DSN and puts the number found in the global macro variable NUM. Which one is best is down to personal preference but remember that the last two can only be used where you have access to SAS version 6.12 or above.


Using PROC SUMMARY for Descriptive and Frequency Statistics

To many, the PROC MEANS and PROC SUMMARY SAS procedures are the same. There are however two differences.

The first difference is that the SUMMARY procedure does have as default to print no output to an output file while the MEANS procedure does by default. The option that controls this is PRINT/NOPRINT so it is possible to print the output from the SUMMARY procedure and have no output from the MEANS procedure.

The second difference is not widely known but it is a useful. When the VAR statement is missing in the MEANS procedure analysis is carried out on all numeric variables, as shown in the following example (output below):

  data vitals;
    infile cards;
    input patid $3. heart_rate temperature trtcd $1.;
  cards;
  001 72 35.8 A
  002 80 36.4 B
  003 99 36.6 A
  ;
  run;

  proc means data=vitals nway;
    class trtcd;
  run;

                                   The SAS System
                                 The MEANS Procedure

           N
  trtcd  Obs  Variable     N        Mean     Std Dev     Minimum     Maximum
  --------------------------------------------------------------------------
  A        2  heart_rate   2  85.5000000  19.0918831  72.0000000  99.0000000
              temperature  2  36.2000000   0.5656854  35.8000000  36.6000000

  B        1  heart_rate   1  80.0000000           .  80.0000000  80.0000000
              temperature  1  36.4000000           .  36.4000000  36.4000000
  --------------------------------------------------------------------------

However, what happens when the SUMMARY procedure is used instead on the same data?

  proc summary data=vitals nway print;
    class trtcd;
  run;

        The SAS System
     The SUMMARY Procedure

                  N
        trtcd    Obs
        ------------
        A          2
        B          1
        ------------

Notice that the only result that came out from the SUMMARY procedure was the number of observations from each treatment group, similar to what the FREQ procedure will produce. The same result will prevail if the only variables in the VITALS dataset were PATID and TRTCD, that is only the character variables.

So what does this finding mean? In most cases SAS programmers use the FREQ procedure for frequency counts and the MEANS procedure for summary statistics however these two sets of statistics most commonly carried out can be done by one procedure. A possible macro for calculating both from one procedure is given below:

  %macro summstat(dsin=,
                    /*Input file -REQUIRED*/
                  dsout=,
                    /*Results filst -REQUIRED*/
                  classvar=,
                    /*Class Variable(s) -REQUIRED*/
                  vvars=,
                    /*Analysis Variables, only if
                      descriptive statistics
                      requested. Can use one or
                      numeric variable names or
                      _NUMERIC_ for all numeric
                      variables.  AUTONAME option will
                      set variable names in output
                      file*/
                  outprt=NOPRINT
                    /*Output to listings file.
                      Values: PRINT|NOPRINT*/
                 );
    proc summary data=&dsin nway &outprt;
      class &classvar;
      %if (&vvars ne ) %then %do;
        var &vvars;
        output out=&dsout (drop=_type_ _freq_)
               n= mean= std= median= min= max=
               /autoname;
      %end;
      %else %do;
        output out=&dsout (drop=_type_ rename=(_freq_=N));
      %end;
    run;
  %mend summstat;

Finding the Last date of the Month

Sometimes it is necessary to find the last date of a month. Remembering that the last date for a month is not the same across all months of the year the following code will help:

            Last_Day_of_Month=INTNX('MONTH',SAS_Date,0,'END');

Note the use of the INTNX function which is useful for time interval calculations.


Creating a CSV file of a SAS Dataset

Passing a SAS dataset to Excel? Transferring data from one application to another is always a tricky issue. One of the usual ways this is done is converting the data into a common format that both applications can read and write - CSV is one such established format.

There are a number of ways that SAS will create the CSV file. The first is using the PROC EXPORT procedure and using the CSV as the filename extension, as the following example shows:

    PROC EXPORT DATA=sashelp.class
        OUTFILE="c:\taui\class.csv";
    RUN;

Before the EXPORT procedure came into SAS, the data step was used to generate the CSV file, as the following example shows:

    DATA _NULL_;
        FILE "c:\tuai\class.csv";
        SET sashelp.class;
        PUT (_all_) (',');
    RUN;

Also available is the DEXPORT command that is run inside the command line of the SAS Desktop, which is shown in the following example:

    DEXPORT sashelp.class "c:\taui\class.csv"

The last method that will be looked at here is using a form of SAS ODS that was introduced in SAS 8.2 - the best way to show it is using the following example:

    ODS CSV FILE='c:\taui\class.csv';
    PROC PRINT DATA = sashelp.class NOOBS;
    RUN;
    ODS CSV CLOSE;

With this usage it is possible to use the standard VAR and WHERE statements to limit the observations and variables being passed to the CSV file.

CSV files can be easily read into Excel or most other spreadsheet or database programs and is a useful format when transferring data from SAS to another application.


Checking a SAS Dataset Exists

Want to check if a particular SAS dataset exists? If you have SAS version 6.12 or above the following code fragment will be useful:

  exist = %SYSFUNC(EXIST(dataset_name));

where the variable EXIST will contain a value of 1 if the dataset is present or 0 if not.


Reordering Variables

Volumes have been written in the past on how to reorder variables in a SAS Dataset. There is the LENGTH, ATTRIB and retain statements inside a SAS dataset but I have found in the past that the best way is the use of the SQL procedure. Lets say you have a SAS dataset DEMOG with the variables AGE, GENDER, SUBJECTID, HEIGHT and WEIGHT, and you want the variable SUBJECTID first (the placement of the other variables is okay), the following code is useful:

  PROC SQL;
    CREATE TABLE demog AS
      SELECT subjectid, *
      FROM demog;
    QUIT;
  RUN;

The resulting dataset DEMOG will have the variables in the order of SUBJECTID, AGE, GENDER, HEIGHT and WEIGHT.


Additional Codes to an Existing Format

Your program has a variable that is coded as 1=YES and 2=NO with an associated format called YN. Now there is a new value of 3=DON'T KNOW in your data but the format YN has not been changed. In your reporting program you can create another format, in this example YNX, using the following code:

  proc format;
    value ynx
      1='YES'
      2='NO'
      3="DON'T KNOW";
  run;

A better way though, avoiding transcription problems, is to make the new YNX format by nesting the old format YN into the new as shown in the following code:

  proc format;
    value ynx
      3="DON'T KNOW"
      other=[yn.];
  run;

Note that you must enclose the existing format name in square brackets, as shown above, or with parentheses and vertical bars, for example (|yn.|).


Concatenating Datasets

Concatenating datasets can be one of the most tricky activities in SAS as you have to know the structure and content of the dataset before this task is done. However, one trick that is very useful is to use SQL to do the concatenation therefore avoiding a number of the problems associated with similar tasks using either the DATA step or APPEND procedure. The following example shows the SQL code used:

  PROC SQL;
    CREATE TABLE outa AS
      SELECT *, 'in1' AS dset FROM in1
             OUTER UNION CORR
      SELECT *, 'in2' AS dset FROM in2
    QUIT;
  RUN;

Deleting SAS Datasets based on a Date

Occasionally it is necessary to cleanup old datasets in a directory based on a date. The following SAS code is an example where datasets in the directory referenced by the LIBNAME OLDDATA that are older than 15MAR2007 are deleted:

  %let dslist=%str();
  proc sql;
    select memname into :dslist separated by ' '
      from sashelp.vtable
      where libname='OLDDATA' and datepart(crdate) < "15MAR2007"d;
    quit;
  run;
  proc datasets  library=OLDDATA nolist nodetails;
    delete &dslist;
    quit;
  run;

This code can be modified to any datasets in a specified directory as well as and date and/or time that the user may choose.


Reading Variable Length Record Files

Ever try to read in a file with variable length records? The following example may be useful and looks at the file INTEXT.TXT which as columns 1-10 as a FLAG for the record and columns 11 onwards as input text:

  DATA a;
    INFILE 'intext.txt' LENGTH=len;
    INPUT flag 1-10 @;
    varlen=len-10;
    INPUT @11 text $VARYING200. varlen;
  RUN;

Changing the Height of a HEADLINE in PROC REPORT when using ODS RTF

Sometimes when creating a report using ODS RTF in SASv8.2, the line under the columns using the HEADLINE option in the REPORT procedure is too thin to be displayed properly on a computer screen but yet can be seen on a printout. The simple reason is that most computer screens use a display width of 72 cells per inch while the basic of printers have the capability of 300 or more cells per inch. To alter the height of the headline line some RTF code has to be sent to the RTF file being created. The following illustrates the usage:

  %let hdrbrdr=%str(style(header)=[pretext="\brdrb\brdrs\brdrw30"]);
  %let hdropt =%str(style(header column)=[protectspecialchars=off]);

  proc report data=pop1 nowindows headline &hdropt split='\' missing;
    columns state pop landarea psqmile;
    define state    /group
                     order=internal
                     style=[cellwidth=80 just=left]
                     &hdrbrdr
                     'State';
    define pop      /display
                     style=[cellwidth=30 just=center]
                     &hdrbrdr
                     'Population\(2003)'
                     format=comma11.;
    define landarea /display
                     style=[cellwidth=30 just=center]
                     &hdrbrdr
                     'Land Area\(sq. miles)'
                     format=comma11.;
    define psqmile  /display
                     style=[cellwidth=30 just=center]
                     &hdrbrdr 'People per square mile'
                     format=comma11.1;
    quit;
  run;

In the example above it is the HRDBRDR macro variable that controls the width of the line, specifically the number at the end of the definition. Note that in this example the width is set to '30' twips but can be altered to any value.


How Quantiles are Calculated

The way a statistic is calculated may be more important than the result it produces. Recently an example showed up when some statistics were being checked using Excel on results produced with SAS, specifically with the calculation of a first and third quartile, also known as the 25th and 75th percentile.

For the data 1, 2, 3, 4, 5, 6, 7 and 8 the following results are calculated for the 25th percentile:

SAS Method 5 (default) = 2.5
SAS Method 4 = 2.25
Excel = 2.75

Why the difference? There actually is no standard for the calculation of percentile and it does depend on what a statistician is looking for. SAS has six methods of calculating the percentile, the two common ones being:

Method 5: y = (xj - xj+1)/2 if g=0 or y = xj+1 if g>0, where n*p=j+g
Method 4: y = (1-g)*xj + g*xj+1, where (n+1)*p=j+g and xn+1 is taken to be xn

Excel, S-Plus and StarOffice Calc by comparison uses a different method, specifically:

y = (1-g)*xj+1+g*xj+2 where (n-1)*p=j+g, and both xn+1 and xn+2 is taken to be xn

Among the major statistical software packages only Excel, S-Plus and StarOffice Calc use this method. For those using Minitab or SPSS they use the SAS Method 4 for their calculation.

The moral of this example is that you should know how your software calculates a statistic before blindly reporting the result.


Variance Calculation Differences

I received an email in early 2005 from a SAS programmer asking me if I could help explain why he was getting differences when comparing the variance results between SAS and Excel using the same set of numbers.

Traditionally the variance is calculated as

(a)

Alternately the equation can be written as

(b)

The first equation (a) requires two passes of the data - the first pass to calculate the mean and the second to calculate the variance.

This second equation (b) is more commonly known as the Desktop Calculator Formula as it is possible to calculate the variance with one pass of the data. There is one issue with this formula as if the numbers are big and the differences between the numbers are small an incorrect result will prevail. This is because computers and calculators store results as real numbers and some precision may be lost when storing the sum of the square values across a long list of numbers. As an example calculate the variance of 10,000,001, 10,000,003 and 10,000,005 - the answer is 4 but some calculators may produce an answer of 0.

Now what about SAS and Excel, what do they do. SAS uses the Traditional Formula. Microsoft used the Desktop Calculator Formula until Excel 2003 (for Windows) and Excel (for Mac) when it changed to use the Traditional Formula1. The Desktop Calculator Formula appears to have been used by Microsoft as it was faster than doing the Traditional Formula, the latter requiring two passes of the data.

Is there any calculations that will produce a result with one pass of data? The answer is yes. One formula is known as the Method of Provisional Means where

and

There are other methods that calculate the variance in one pass of the data that are found in some statistical textbooks.

The moral of the story is to always check with your software documentation to see how statistics are being calculated.

--------------
1 Reference Microsoft Knowledgebase, article 828888, dated March 10, 2005


Getting the Comment text of an Excel Cell

Not a SAS tip but one that is useful within Excel. The following VBA can be used to get the text of a comment behind a cell in an Excel Spreadsheet and place it in another specified cell:

  Function GComTxt(rCellComment As Range)
    On Error Resume Next
    GComTxt = WorksheetFunction.Clean(rCellComment.Comment.Text)
    On Error GoTo 0
  End Function

To enter the VBA code into Excel select

  Tools > Macro > Visual Basic Editor

then

  Insert > Module

Then paste the code above and then press

  File > Close and Return to Microsoft Excel

The function is now ready for use within Excel and can be accessed from the User Defined function list.


Getting the Background Color of a Cell in Excel

Again, not a SAS tip but one that is useful within Excel. The following VBA can used to get the background color a cell in an Excel Spreadsheet and place it in another specified cell:

            Function showColorCode(rcell)
                showColorCode = rcell.Interior.ColorIndex
            End Function

To enter the VBA code into Excel select

            Tools > Macro > Visual Basic Editor

then

            Insert > Module

Then paste the code above and then press

            File > Close and Return to Microsoft Excel

The function is now ready for use within Excel and can be accessed from the User Defined function list.

The codes returned from the function will show the background color of the cell. Microsoft unfortunately did not keep codes consistent across different releases of Excel. The following list is a general guide that is useful when referring to color codes for Excel 97:

  • -4142 = No Color
  • 1 = Black
  • 2 = White
  • 3 = Red
  • 5 = Blue
  • 6 = Yellow
  • 10 = Green

A DOS .BAT File for Backing Up a File with Date A Stamp

This is not a direct SAS tip but a DOS .BAT file that I find useful when writing programs and wanting to keep a copy of a file with a date/time stamp embedded in the file name, in a backup directory, directly under the directory I am working in. The code for this file is the following:

  @ECHO OFF
  if z%1 == z goto usage
  If not exist bkup\Nul MD bkup
  FOR /F "TOKENS=2-4 DELIMS=/ " %%F IN ('DATE /T') DO (SET TODAY=%%F-%%G-%%H)
  FOR /F "TOKENS=1-2 DELIMS=: " %%I IN ('TIME /T') DO (SET NOW=%%I%%J)
  copy %~n1%~x1 bkup\%~n1-%today%-%now%%~x1
  goto end
  :usage
  echo. Usage is BKUP filename.
  :end

The usage for this file, where the code is stored in the file BKUP.BAT, is

BKUP filename

An example of its usage is the file DEMO.SAS where after running the command

BKUP DEMO.SAS

the file would appear in the BKUP directory with the name DEMO-01-13-2004-1238.SAS where the "01-13-2004" is the date in local format from the DOS country setting and the "1238" is the time. This code will only work when your operating system is Windows NT, Windows 2000 or Windows XP - it will not work in any other environment. Note that if the directory BKUP is not present then it will be created.


Calculating the Distance Between Two Points on the Earth's Surface

Not actually SAS related but here is something that may be useful to someone.

Before GPS could tell us what the distance was between two points on the Earth's Surface there were many ways that were used to estimate that distance. One of the most simplist methods was the Great Circle Distance which was to treat the earth as a sphere and use the following formula:

d = 60 * ARCOS ( SIN(l1) * SIN(l2) + COS(l1) * COS(l2) * COS(g2 - g1))

where

l1 = latitude at the first point (degrees)
l2 = latitude at the second point (degrees)
g1 = longitude at the first point (degrees)
g2 = longitude at the second point (degrees)
d = computed distance (nautical miles)

This method of calculation does assume that 1 minute of arc is 1 nautical mile. To convert the distance from nautical miles to kilometers, multiply the result d by 1.852. Similarly to convert the distance from nautical miles to standard miles multiply the result by 1.150779. Note that if your calculator returns the ARCOS result as radians you will have to convert the radians to degrees before multiplying by 60, i.e. where degrees = (radians/PI)*180, where PI is approximately 3.141592654.


Keep a Notebook Handy

When a programmer is starting out the one piece of advice I give is have a notebook handy. When you are coding, there will be pieces of code that you will see or use and think "that is interesting". Don't be afraid to copy, or print and paste, that code into a notebook and keep it handy -- you don't know when you are going to need it. You will be amazed at the number of times you will refer to it for that 'little trick'.


WHERE, KEEP/DROP or RENAME -- Which comes first?

In the datastep which is first -- WHERE, KEEP/DROP or RENAME? Think alphabetical. First is the DROP/KEEP, the RENAME then the WHERE. So you could see a piece of code that looks like:

  data final;
    set sashelp.class;
    keep name height weight;
    rename height=htin weight=wtlb;
    where htin ge 65 and wtlb ge 50;
  run;

which would create a dataset FINAL with the variables NAME, HTIN and WTLB, and subset where HTIN ge 65 and WTLB ge 50.


Making that Permanent SAS Dataset

Before a dataset is sent to a permanent area we often want to rename, drop or modify variables, including the order. Unfortunately SAS PROC DATASETS and SAS PROC SQL; ALTER statements can only rename or modify a dataset variable, but cannot drop it, or conversly keep a set of varibles. We also want to make certain the varabies are in a particular order.

You can certainly do something like:

  data final;
    attrib name length=$40 label='Name'
           htin length=8 label='Height (in)'
           wtlb length=8 label='Weight (lb)';
    set sashelp.class (rename=(height=htin weight=wtlb));
    keep name htin wtlb;
  run;

This certainly works BUT it means typing some variable names in multiple times. What about PROC DATASETS?

  proc datasets library=work nolist nodetails nowarn;
    append base=final data=sashelp.class (keep=name height weight);
    modify final;
      rename height=htin weight=wtlb;
      label htin='Height (in)' wtlb='Weight (lb)';
    quit;
  run;

The PROC DATASETS solution does not allow us to reposition a variable in a SAS dataset as does the datastep, but does have fewer cases where we have to write the dataset variable multiple times. But for efficiency, how can we only write a variable once? Enter stage left, PROC SQL, as the following example shows:

  proc sql;
    create table final as
      select name,
             height as htin label='Height (in)',
             weight as wtlb label='Weight (lb)'
      from sashelp.class;
    quit;
  run;

As noted in the PROC SQL we only mentioned variables once, not multiple times as in the DATASTEP example.

You do not need to be a SQL expert to use SAS, infact in some cases it hinders things by those who do not know SQL past a basic level, supporting the code. However, a good SAS programmer does need to have some knowledge of what proc SQL can do to do their work efficiently.


Getting rid of non-printable characters

When data comes there are sometimes characters that are in the data which cannot be printed, e.g. carriage returns or tab characters. How do we fix this?

There are two common ways to do this.

First is use the COMPRESS funtion with the 'kw ' modifier, e.g.

 
  data class;
    set sashelp.class;
    name=compress(name,'kw');
  run;

Second is a little more complicated, but it is more controllable, and uses the RANK funtion. Most systems and applications will handle the ASCII characters 32 through 126 inclusive, which is the standard set of LATIN1 characters -- it is possible to adapt this for your particular situation and characters. The code in this case would be something like:

  data class;
    set class;
    do _k=1 to lengthn(name); *Look at every character in string variable NAME;
      if not(32 le substr(rank(name),_k,1) ge 126) then substr(name,_k,1)='$';
    end;
    name=compress(name,'$'); *Remove 'bad' characters;
    drop _k;
  run;

In this data we took a 'bad' character and replaced it with something to later remove, in this case the '$' symbol as this is generally not in normal text in clinical trials.

To extend the last code to do all character variables in a dataset the use of an array of all character varabies is crtical, as the following code uses;

  data class;
    set class;
    array zaz {*} _character_;
    do _i=1 to dim(zaz);
      do _k=1 to lengthn(zaz(Ii)); *Look at every character in string variable NAME;
        if not(32 le substr(rank(zaz(_i)),_k,1) ge 126) then substr(zaz(_i),_k,1)='$';
      end;
      zaz(_i)=compress(zaz(_i),'$'); *Remove 'bad' characters;
    end;
    drop _i _k;
  run;

Creating directories within your SAS Program

It is possible to create a directory within a SAS program on some operating systems using the DLCREATEDIR option, available since SAS 9.3. Why is this useful? It is easy to create a standard set of directories easily, as shown in the following example:

  options DLCREATEDIR;
  %let basedir=%str(c:\tuai);
  libname rawdata "&basedir.\data\rawdata";
  libname sdtm "&basedir.\data\sdtm";
  libname adam "&basedir.\data\adam";
  libname psdtm "&basedir.\programs\sdtm";
  libname padam "&basedir.\programs\adam";
  libname listings "&basedir.\programs\listings";
  libname tables "&basedir.\programs\tables";
  libname graphs "&basedir.\programs\graphs";
  libname pmacros "&basedir.\programs\macros";
  run;

The program above will create the directories:

  c:\taui
  c:\taui\data\rawdata
  c:\taui\data\sdtm
  c:\taui\data\adam
  c:\taui\programs\sdtm
  c:\taui\programs\adam
  c:\taui\programs\listings
  c:\taui\programs\tables
  c:\taui\programs\graphs
  c:\taui\programs\macros

It is possible to use concatenated to libnames to also do this, as shown in the following example, to create the data and sub-directories:

  options DLCREATEDIR;
  %let basedir=%str(c:\tuai);
  libname data ("&basedir.\data" "&basedir.\data\rawdata" 
                "&basedir.\data\sdtm" "&basedir.\data\adam");

If you get a note indicating that the option DLCREATEDIR is restricted at your site, this means that your administrator has restricted you from creating directories -- in this case you have to talk to your IT department.


Using PROC SQL Produce a Listing with Wrapped Lines

PROC SQL has a little secret -- it can wrap long lines of text onto multiple lines of output using a FLOW option, as seen below:

  options ls=64 nonumber nodate;
  title1 "Ford Sports Cars (USA)";
  proc sql;
    reset flow;
    select strip(model) label='Model', msrp,
           enginesize, mpg_city, mpg_highway
      from sashelp.cars
      where make='Ford' and type='Sports';
    reset noflow;
    quit;
  run;

will produce the output

  Ford Sports Cars (USA)

                                   Engine     MPG        MPG
  Model                    MSRP  Size (L)  (City)  (Highway)
  ----------------------------------------------------------

  Mustang 2dr           $18,345.      3.8.     20         29
  (convertible)
  Mustang GT Premium    $29,380       4.6      17         25
  convertible 2dr
  Thunderbird Deluxe    $37,530.      3.9      17         24
  convert w/hardtop 2d

Note that the Model column has been wrapped for when there are multiple lines - the FLOW option using the RESET statement switched this option on. Also note the line under the column labels.


Kaplan Meier Survival Table

A programmer recently asked me to help out on a survival table that I thought I would share. The request was for an analysis of the number of Patients at Risk, with events, censored, Kaplan Meier estimate and Median Survival Time, by visit intervals. See below for the template (this is not the actual intervals in the request but is used for display purposes):

                                                               Median
                            Patients                          Survival
                  Patients    with    Patients  KM estimate  Time (days)
  Visit Interval  at risk    events   censored  % [95% CI]   % [95% CI]

  0 to <4 days       xx        xx        xx     xx [xx,xx]   xx [xx,xx]
  4 to < 8 days      xx        xx        xx     xx [xx,xx]   xx [xx,xx]
  8 to < 12 days     xx        xx        xx     xx [xx,xx]   xx [xx,xx]
  12 to <16 days     xx        xx        xx     xx [xx,xx]   xx [xx,xx]

In R this is very easy to do as one of the functions actually produces this output, but not SAS where you have to do some work. Just as an aside, in the past five years or so I am seeing more specs have a similar output to that which R can produce by default. For purposes of this article here is the data that is being used:

  data _indat;
    infile cards;
    input patid t status;
  cards;
  1 1 1
  2 2 0
  3 3 0
  4 4 1
  5 5 0
  6 10 1
  7 12 0
  ;
  run;

The T variable is the number of days, while the STATUS variable gives the status of the at the end of the study with 1=an event, 0=ended study without event (i.e. censored).

The first part of the table deals with the number at risk during each interval, and this is calculated using the following code:

  ** Calculate At Risk numbers at start of each time peiod;
  proc lifetest data=_indat atrisk
                timelist=0 4 8 12 16;
    time T * Status(0);
    ods output productlimitestimates=pi (keep=timelist numberatrisk);
  run;

Note use of the ATRISK and TIMELIST options which tell SAS we want the number at risk at those timepoints. We will get the actual variables needed.

Next is the number of failed and censored:

  ** Calculate survival results for
     days 0, 4, 8, 12 and 16;
  proc lifetest data=_indat intervals=0 to 16 by 4 
                method=lt maxtime=12
                outsurv=si (keep=T survival sdf_lcl sdf_ucl);
    time T * Status(0);
    ods output lifetableestimates=ki (keep=uppertime failed censored);
  run;

Two datasets are created here -- dataset KI has thenumber that failed (had event) and were censored, while SI has the KM estimate with 95% CI.

The of the LIFETEST calls brings in the median survival time:

  ** Median Survival;
  proc lifetest data=sashelp.bmt;
    time T * Status(0); where group='ALL';
    ods output quartiles=qi (keep=percent estimate lowerlimit upperlimit);
  run;

Bringing these datasets together now is the tricky part. The time (variable TIMELIST) in PI refers to the time at the start of the interval, which the time variables in datasets KI and SI refer to the period immediately before the end of the time in question - we actually have to realign them to the start of the interval, which can be easily done by:

  data ki0;
    set ki;
    timelist=uppertime-4;
  data si0;
    set si;
    timelist=t-4;
  run;

The next task is to get the median survival data and align it to the first interval starting time, i.e. TIMELIST=0 as in the PI dataset:

  data qi0;
    set qi (where=(percent=50));
    timelist=0;
  run;
p>Now we bring this altogether:

  proc sort data=pi;
    by timelist;
  proc sort data=ki0;
    by timelist;
  proc sort data=si0;
    by timelist;
  proc sort data=qi0;
    by timelist;
  data all0;
    merge pi ki0 si0 qi0;
    by timelist;
    if timelist in(0,4,8,12);
  run;

What we have now is the single dataset needed, with a little more manipulation, the results we requested. Note that we have restricted the dataset to deal with only the time intervals we are dealing with - this will remove a few rows that have been created when we realigned the time intervals and merged the data.

The variables relating to Patients at risk, Patients with events, and Patients censored, all refer to numbers, but the last two have results from two or more variables that need to be set:

  data all1;
    length _kmestimate _median $60;
    set all0;
    _kmestimate=put(survival,4.2)||' ['||put(sdf_lcl,4.2)||
                ','||put(sdf_ucl,4.2)||']';
    if timelist=0 then
     _median=put(estimate,4.1)|| '['||put(lowerlimit,4.1)|| ','||
             put(upperlimit,4.1)||']';
  run;

Now we have our dataset it is just time to add a format for our first column:

  proc format;
    value timef 0='0 to <4 days' 4='4 to <8 days'
                8='8 to <12 days' 12='12 to <16 days';
  run;

and then output our table, in this case just using PROC PRINT although you would generally use PROC REPORT:

  proc print data=all1 noobs label split='|';
    var timelist numberatrisk failed censored _kmestimate _median;
    format timelist timef.;
    label timelist='Visit Interval'
          numberatrisk='Patients|at|risk'
          failed='Patients|with|events'
          censored='Patients|censored'
          _kmestimate='KM estimate|%|[95% CI]'
          _median='Median Survival|Time|(days) [95% CI]';
  run;

getting

                                                                       Median
                            Patients                                  Survival
                  Patients    with    Patients     KM estimate       Time (days)
  Visit Interval  at risk    events   censored     % [95% CI]        % [95% CI]

  0 to <4 days        7         1         2     0.83 [0.27,0.97]   10.0 [ 1.0, . ]
  4 to < 8 days       4         1         1     0.60 [0.11,0.89]
  8 to < 12 days      2         1         0     0.30 [0.01,0.72]
  12 to <16 days      1         0         1     0.30 [0.01,0.72]

As you can see this is not an easy table to produce, but the code given here is a basis for it. Try it.


Creating a DATETIME from DATE and TIME Variables

There are many ways to combine a date variable and a time variable into a datetime variable but the easist way is using the DHMS funtion using the following:

  datetime = DHMS ( datevar, 0 , 0 , timevar );

While this does seem to be strange at first it is important to note that a time variable is the number of seconds from mid-night, hence using the time variable value in the seconds parameter.


The LABEL for a Variable

There are many parts to a variable, key among them are the name of the variable (NAME), type (TYPE), and length (LENGTH). Under SAS version 8 and above the name of a variable can be 32 characters long, its type can be either Character or Numeric, and the length for a character variable can be set between 1 and 32767 characters while a numeric variable can be set between 3 and 8 (yes, some operating systems allow for 2 but not all).

There are other parts to a variable, but the three most common are the informat, format and label - we shall look at this last part in this post.

What exactly is a label and why is it used? In simple terms the label is a short descriptive text giving a more user friendly description of the variable. Take for example a very simple example, the variable WEIGHT in a dataset - sure it tells me what the variable is (it is weight) but it does not tell me other things I may need to know like what unit it is in(inches) or when the weight data point was collected, all of this being very important information. Given that I am allowed 32 characters for a variable name I could change the variable name to reflect this information, but in most cases it is not possible to do this and it indeed impractical. So here comes the use of the LABEL statement.

The label itself can be 256 characters, including blanks, in length (40 characters if you are still using SAS version 6.xx) - this gives us plenty of room to write a good description of the variable. Setting the label is commonly achieved by using the LABEL statement inside a datastep or the LABEL option under the MODIFY statement in the DATASETS procedure, the syntax of which is given below:

  data class; *** Inside a datastep;
    set class;
    label weight='Weight (kg) at Start of Study';
  run;
  proc datasets library=work; *** DATASETS procedure;
    modify class;
      label weight='Weight (kg) at Start of Study';
    quit;
  run;

I personally use the datastep method if I am creating or modifying a variable inside that datastep, otherwise I use the DATASETS procedure, but is only my convention - there is no set rule with this.

Before going on to a real world example, lets first see how we would look at what labels are set, if any, inside a dataset. The two easiest ways to look at the data is to either run a CONTENTS procedure call on the dataset, or if you have the SAS Viewer installed, look at the attributes window for that dataset.

Now lets look at a real world example where it will all come clearer. For this example I will use the dataset SASHELP.CLASS. Lets look first at the structure of the dataset using the CONTENTS procedure (will actually make a copy first so I don't overwrite the original data):

  data class; *** Make a copy of the dataset;
    set sashelp.class;
  run;
  proc contents data=class;
   *** Get structure of the dataset;
  run;

Running this code we get the following output (abridged):

  Alphabetic List of Variables and Attributes

  #   Variable   Type   Len 

  3   Age        Num     8
  4   Height     Num     8
  1   Name       Char    8 
  2   Sex        Char    1
  5   Weight.    Num     8

If a label existed for a variable we would see a column headed "Label", but in this case there is no labels applied to the dataset. So now lets set one for WEIGHT as we indicated above (this time I shall do the content structure not from the CONTENTS procedure, but the CONTENTS statement inside the DATASETS procedure):

  proc datasets library=work;
    modify class;
      label weight='Weight (kg) at Start of Study';
    contents data=class;
    quit;
  run;

Running this code we get the following output (abridged):

  Alphabetic List of Variables and Attributes
 
  #   Variable   Type   Len   Label
  3   Age        Num    8
  4   Height     Num    8
  1   Name       Char   8
  2   Sex        Char   1
  5   Weight     Num    8     Weight (kg) at Start of Study

As you can see the variable WEIGHT now has a label - the reason why it is useful will become clear shortly. For the purposes of this example I will also now add a label to the variable AGE and HEIGHT using the DATASETS procedure as above:

  proc datasets library=work;
    modify class;
      label height='Height (in) at Start of Study'
            age='Age (years) at Start of Study';
    contents data=class;
    quit;
  run;

Note that I did not redo the label for the variable WEIGHT in the above code as it was already done previously, although I could have put in the step and even replaced it with new text.

Now why is the label useful? As you can see already it gives a useful description of what the variable is. Now lets extend that to a small report using the PRINT procedure:

  proc print data=class noobs;
  run;

that produces the following output (abridged):

  Name       Sex    Age    Height    Weight
   
  Alfred      M      14     69.0      112.5
  Alice       F      13     56.5       84.0
  Barbara     F      13     65.3       98.0
  Carol       F      14     62.8      102.5
  Henry       M      14     63.5      102.5

The column headers are the same as the variable names, and we have the same problem of not knowing what the units are for Age, Height and Weight (but did we not put that in a label earlier) but lets add another option to get the labels:

  proc print data=class noobs LABEL;
  run;

that produces the following output (abridged):

                 Age (years)  Height (in)   Weight (kg)
                  at Start     at Start      at Start
  Name     Sex    of Study     of Study      of Study
 
  Alfred     M         14         69.0          112.5
  Alice      F         13         56.5           84.0
  Barbara    F         13         65.3           98.0
  Carol      F         14         62.8          102.5
  Henry      M         14         63.5          102.5

Now, as you can see, we have a report that gives a clear description of the variables though the use of dataset variable labels. There will be those reading this who will say that I could have put the label in the PRINT procedure call (yes, the LABEL statement is a global statement that can be used in almost any procedure) and would be done this way using the following code:

  proc print data=class noobs LABEL;
    label weight='Weight (kg) at Start of Study'
          height='Height (in) at Start of Study'
          age='Age (years) at Start of Study';
  run;

But the one reason I don't normally use this method is that the label does not carry forward in the dataset. Now see a complete example where the CLASS data is copied from the SASHELP directory, useful labels put on the dataset, then doing a PRINT, MEANS and TABULATE call on the same data, all with the same labels:

  proc datasets library=work;
    copy in=sashelp out=work;
      select class;
    modify class;
      label weight='Weight (kg) at Start of Study'
            height='Height (in) at Start of Study'
            age='Age (years) at Start of Study';
     quit;
   run;
   proc print data=class noobs LABEL;
   run;
   proc means data=class;
     var age height weight;
   run;
   proc tabulate data=class;
     class sex;
     var age height weight;
     tables age*
              (n*f=8.0 mean*f=8.2 std*f=8.3
               median*f=8.2 min*f=8.0 max*f=8.0)
              (height weight)*
               (n*f=8.0 mean*f=8.3 std*f=8.4
                median*f=8.3 min*f=8.1 max*f=8.1),
            sex all='Total';
   run;

produces the following output (abridged):

                The PRINT Procedure

                    Age (years)    Height (in) Weight (kg)
                      at Start       at Start   at Start
   Name      Sex      of Study       of Study   of Study

   Alfred     M          14            69.0          112.5
   Alice      F          13            56.5           84.0
   Barbara    F          13            65.3           98.0


                  The MEANS Procedure

   Var    Label                         N Mean STD Min Max      
   Age    Age (years) at Start of Study 19  13  1   11  16
   Height Height (in) at Start of Study 19  62  5   51  72
   Weight Weight (kg) at Start of Study 19 100 23   51 150


                The TABULATE Procedure

   ----------------------------------------------------
   |                       |       Sex       |        |
   |                       |-----------------|        |
   |                       |   F    |   M    | Total  |
   |-----------------------+--------+--------+--------|
   |Age (years)|N          |       9|      10|      19|
   |at Start of|-----------+--------+--------+--------|
   |Study      |Mean       |   13.22|   13.40|   13.32|
   |           |-----------+--------+--------+--------|
   |           |Std        |   1.394|   1.647|   1.493|
   |           |-----------+--------+--------+--------|
   |           |Median     |   13.00|   13.50|   13.00|
   |           |-----------+--------+--------+--------|
   |           |Min        |      11|      11|      11|
   |           |-----------+--------+--------+--------|
   |           |Max        |      15|      16|      16|
   |-----------+-----------+--------+--------+--------|
   |Height (in)|N          |       9|      10|      19|
   |at Start of|-----------+--------+--------+--------|
   |Study      |Mean       |  60.589|  63.910|  62.337|

Note that the label we defined in the DATASETS procedure carried forward to the PRINT, MEANS and TABULATE procedures.


DO Loops

DO loops with an index variable is something that is a basic task in our code, e.g.

  data _null_;
    do k=1 to 10;
      put k=;
    end;
  run;

This will simply put out a set of numbers in the SAS LOG from 1 to 10.

We can also use a BY option to step the count, for example:

  data _null_;
    do k=1 to 10 by 2;
      put k=;
    end;
  run;

This will produce the values 1, 3, 5, 7 and 9 in the SAS LOG.

Another form is without the 'TO' that uses set values, as the folowing example shows:

  data _null_;
    do k=1,7,5,2,9;
      put k=;
    end;
  run;

This will put out a set of numbers in the SAS LOG of 1, 7, 5, 2 and 9. But you are not restricted to numbers, you can also do character values, as the following example shows:

  data _null_;
    do name="Luna","Milo","Bella","Willow";
      put name=;
    end;
  run;

This will put out the set of names to the SAS LOG of "Luna", "Milo", "Bella" and "Willow".

DO OVER

The DO OVER is not one that is commonly used but it is useful to know.

The DO OVER arrayname is equivalent to a DO i=1 TO DIM(arrayname). An example of its use is:

  data _null_;
    set sashelp.class;
    array zaz height weight age;
    do over zaz;
      if missing(zaz) then zaz=.M;
    end;
  run;

The above will go through each record, looking at the variables HEIGHT, WEIGHT and AGE, and if missing set to '.M' a value that we will represent as missing.

Another more common way you will see this is with the DO ... TO statements as shown below:

  data _null_;
    set sashelp.class;
    drop i;
    array zaz(*) height weight age;
    DO i=1 TO DIM(zaz);
      if missing(zaz(i)) then zaz(i)=.M;
    END;
  run;

The use of the DIM function tells the DO loop how many variables that are being used in the array. Whether you like one or the other form is up to personal preference but the DO ... TO ... form is a little more contolling since you can specify what variables in the array to process depending on some specified condition, e.g.

  data _null_;
    set sashelp.class;
    drop i;
    array zaz(*) height weight age;
    DO i=1 TO DIM(zaz);
      if missing(zaz(i)) then do;
        if i=3 and missing(zaz(i)) then call missing(zaz(1),zaz(2));
      end;
      else zaz(i)=.M;
    end;
  run;

If variable AGE is missing then it will set WEIGHT and HEIGHT to missing, else it will set WEIGHT, HEIGHT and AGE to '.M'.

DO WHILE / DO UNTIL

The DO WHILE condition continutes to execute statements in a DO loop while a condition is true, checking the condition BEFORE each iteration of the DO loop. An example of this is:

  data _null_;
    k=1;
    do while (k<8);
      put k=;
      k+1;
    end;
  run;

This will put out a a set of numbers in the SAS LOG of 1, 2, 3, 4, 5, 6 and 7 -- it fails when the index variable k is 8.

The DO UNTIL condition continutes to execute statements in a DO loop while a condition is true, checking the condition AFTER each iteration of the DO loop. An example of this is:

  data _null_;
    k=1;
    do until (k=8);
      put k=;
      k+1;
    end;
  run;

This will put out a a set of numbers in the SAS LOG of 1, 2, 3, 4, 5, 6, 7 and 8 -- it fails when the index variable k is 8 but still puts out the value '8' as it does not end until the END of the DO loop.

Using the WHILE or UNTIL options can result in an infinite loop where the condition is and never can be met. Remembering that for TRUE and FALSE is represented by 1 and 0 respecively, using DO UNTIL(0) or DO WHILE(1) will result in an endless loop, or something simple as DO I=1 BY 1 will cause the same -- to stop this we will look at some techniques later in the year.

Infinite loops

Since the TO stop is optional for the index-variable specification, the following code is perfectly syntactically correct:

  data C;
    do j=1 by 1;
      output;
    end;
  run;

It will result in an infinite (endless) loop in which resulting data set will be growing indefinitely.

While unintentional infinite looping is considered to be a bug and programmers’ anathema, sometimes it may be used intentionally. For example, to find out what happens when data set size reaches the disk space capacity… Or instead of supplying a “big enough” hard-coded number (which is not a good programming practice) for the loop’s TO expression, we may want to define an infinite DO-loop and take care of its termination and exit inside the loop. For example, you can use IF exit-condition THEN LEAVE; or IF exit-condition THEN STOP; construct:

  • LEAVE statement immediately stops processing the current DO-loop and resumes with the next statement after its END.
  • STOP statement immediately stops execution of the current DATA step and SAS resumes processing statements after the end of the current DATA step.

The exit-condition may be unrelated to the index-variable and be based on some events occurrence. For instance, the following code will continue running syntactically “infinite” loop, but the IF-THEN-LEAVE statement will limit it to 200 seconds:

  data D;
    start = datetime();
    do k=1 by 1;
      if datetime()-start gt 200 then leave;
      /* ... some processing ...*/
      output; 
    end;
  run;

You can also create endless loop using DO UNTIL(0); or DO WHILE(1); statement, but again you would need to take care of its termination inside the loop.

Changing TO stop within DO-loop will not affect the number of iterations If you think you can break out of your DO loop prematurely by adjusting TO stop expression value from within the loop, you may want to run the following code snippet to prove to yourself it’s not going to happen:

  data E;
    n = 4;
    do i=1 to n;
      put i=;
      output;
      if i eq 2 then n = 2;
    end;
  run;

This code will execute DO-loop 4 times despite that you change value of n from 4 to 2 within the loop.

According to the iterative DO statement documentation, any changes to stop made within the DO group do not affect the number of iterations. Instead, in order to stop iteration of DO-loop before index variable surpasses stop, change the value of index-variable so that it becomes equal to the value of stop, or use LEAVE statement to jump out of the loop. The following two examples will do just that:

  data F;
    do i=1 to 4;
      put i=;
      output;
      if i eq 2 then i = 4;
    end;
  run;
 
  data G;
    do i=1 to 4;
      put i=;
      output;
      if i eq 2 then leave;
    end;
  run;

DO LOOPS are very powerful and something the SAS Programmer must understand to do good programming.


Updated March 14th, 2022