Return to Homepage

Goto the Tip of the Month Archive

Other interesting pages ...
LinkedIn Profile
SAS Cheat Sheet
Useful SAS Code
Full SAS Example
Basic Statistics
Contact Information

SAS Tip of the Month
June 2015
(for SAS)

Recently I was reminded that knowing your data is one of the important things to understand when you start programming.

As always, lets start looking at some data – in this example it is beans in a cup from 50 people:

  data _testdat;
     infile cards;
     input beans @@;
  cards;
  2 7 32 10 11 7 20 0 20 37
  26 2 26 41 8 16 23 21 23 .
  1 6 26 22 47 18 18 12 -5 10
  ;
  run;

The request is simple – want to know the N, maximum and minimum values for the groups 0 to 20, 21 to 40 and >41.

What programmers would typically write is something along the lines of:

  proc format;
     value beancatf
       1='0-20'
       2='21-40'
       3='>40';
  run;
  data _test1;
     set _testdat;
     * Create a categorized weekly beanage variable;
     if (beans le 20) then bean_cat=1;
     else if (21 le beans le 40) then bean_cat=2;
     else if (beans > 40) then bean_cat=3;
     format bean_cat beancatf.;
  run;
  proc means data=_test1
             nmiss min max missing maxdec=0;
     class bean_cat;
     var beans;
  run;

which results in the following output:

  The MEANS Procedure

                Analysis Variable : beans

                N       N
  bean_cat    Obs    Miss        Minimum        Maximum
  -----------------------------------------------------
  0-20         33       3             -5             20
  21-40        15       0             21             37
  >40           2       0             41             47
  -----------------------------------------------------

For some we have done what is asked, lets save the output and send off. There are however two interesting results here which as programmers we should look at, both highlighted.

The first is the Number Missing of 3 – this is usually important to those analyzing data as the number of missing can influence the results. This count should be considered a separate category.

The second number of interest is the minimum value at the 0-20 category as we are expecting a value somewhere between 0 and 20. This then raises the issue of whether the value of -5 was correctly entered. For the purposes of this article we shall treat this as an outlier, a value is present but the value is ‘out of range’.

So lets now alter the program to deal with these two situations, the missing and ‘outlier’ data:

  proc format;
     value beancatf
       .='Missing'
       0='Other'
       1='0-20'
       2='21-40'
       3='>40';
  run;
  data _test1;
     set _testdat;
     * Create a categorized weekly beanage variable;
     if (0 le beans le 20) then bean_cat=1;
     else if (21 le beans le 40) then bean_cat=2;
     else if (beans > 40) then bean_cat=3;
     else if ^missing(beans) then bean_cat=0;
     format bean_cat beancatf.;
  run;
 
  proc means data=_test1
             nmiss min max missing maxdec=0;
     class bean_cat;
     var beans;
  run;

which results in the following output:

  The MEANS Procedure

                Analysis Variable : beans

                N       N
  bean_cat    Obs    Miss        Minimum        Maximum
  -----------------------------------------------------
  Missing       3       3              .              .
  Other         3       0             -5             -1
  0-20         27       0              0             20
  21-40        15       0             21             37
  >40           2       0             41             47
  -----------------------------------------------------

Now we have something which is presentable to our audience that makes sense.

The moral of this article is to show that although results do appear on an output, it is your responsibility to all see that they are reasonable.

As an aside, the next question is, can we get the same result without having to code the data into the bean_cat variable? Answer is yes, as we can us the format to do this for us. Lets look at the following code:

  proc format;
     value beancatf
       . ='Missing'
       0-20='0-20'
       21-40='21-40'
       41-HIGH='>40'
       OTHER ='Other';
  run;
  proc means data=_test1 nmiss min max missing maxdec=0;
     class beans;
     var beans;
     format beans beancatf.;
  run;

which results in the following output:

  The MEANS Procedure

                Analysis Variable : beans

               N       N
  beans      Obs    Miss         Minimum        Maximum
  -----------------------------------------------------
  Missing      3       3               .              .
  Other        3       0              -5             -1
  0-20        27       0               0             20
  21-40       15       0              21             37
  >40          2       0              41             47
  -----------------------------------------------------

Note that the code the variable BEANS is being used as both the analysis variable and the class variable, with the classes being formed by the format BEANCATF.

Hope this was useful.

________________________________
Updated June 2, 2015