of the Month
Here is something from my vault that would like to share with you. It is an example where SAS can produce unexpected answers and has been known to fool even the most experienced SAS programmer.
First, lets look at our three datasets, each indicating the number and color of all the jellybeans they found in a single packet of sweets:
DATASET:TOM Color Number Multiple ----------------------- RED 1 N YELLOW 3 Y GREEN 2 Y BLUE 0 N DATASET:DICK Color Number ------------- RED 4 YELLOW 2 GREEN 3 BLUE 1 DATASET:HARRY Color Number Multiple ----------------------- RED 2 Y YELLOW 4 Y GREEN 3 Y BLUE 1 N
The variable MULTIPLE in the datasets TOM and HARRY indicate that there were multiple counts for that color, thus when the data is brought together the variable will have to be set in DICK via the program. Below is a first attempt at bringing the data together:
data all0; length name $5; set tom (in=a) dick (in=b) harry (in=c); select; when(a) name='TOM'; when(b) do; name='DICK'; if number>1 then multiple='Y'; if missing(multiple) then multiple='N'; end; when(c) name='HARRY'; end; run;
When the program logic is reviewed, the datastep should bring in the data from datasets TOM, DICK and HARRY. If the data from dataset DICK is being processed then the MULTIPLE variable should be set to Y if NUMBER>1, and if the MULTIPLE is still missing then MULTIPLE should be set to N. Lets look at what the dataset ALL0 actually contains:
name color number multiple ------------------------------------- TOM RED 1 N TOM YELLOW 3 Y TOM GREEN 2 Y TOM BLUE 0 N DICK RED 4 Y DICK YELLOW 2 Y DICK GREEN 3 Y DICK BLUE 1 Y HARRY RED 2 Y HARRY YELLOW 4 Y HARRY GREEN 3 Y HARRY BLUE 1 N
Look closely at value in the variable MULTIPLE for the BLUE count from DICK, it is 'Y' when the logic says it should be 'N'! This is due to a little known feature in SAS where if a variable is being set that is not in the original dataset (but is present in other datasets called in the SET statement) the variable will be treated as if a RETAIN statement was in force!
To get around this be sure that you always explicitly define what should be set to a variable for all possibilities. In the case above, all the program would need is the line
if missing(multiple) then multiple='N';
changed to read
and the datastep would produce what was expected.
Updated November 13, 2004