of the Month
This month is about the use and abuse of the NODUPKEY option in the SORT procedure.
First, lets discuss what exactly is the NODUPKEY option? Referring to the SAS Online help, then NODUPKEY option is:
checks for and eliminates observations with duplicate BY values
So why, you may ask, did this discussion come about. I was looking at some code recently and the programmer used a NODUPKEY with a WHERE clause to select some data for analysis -- this would only work successfully if it was known that there was a unique "list" for the variables concerned.
Lets look at some data that I will use as an example.
DATASET: SHOPDATA SHOP_NUMBER BEAN_COLOR COUNT 001 BLUE 5 003 GREEN 2 002 RED 4 003 RED 3 001 GREEN 6 002 RED 2 002 BLUE 1 003 GREEN 5 001 RED 2
This is not sorted (our data is not regularly sorted in a form that we can use) but we can see in this small set of data that there are two GREEN bean records for 003.
Now lets construct a unique set of SHOP_NUMBER and BEAN_COLOR:
proc sort data=shopdata (keep=shop_number bean_color) out=uniquelist nodupkey; by shop_number bean_color; run;
The code above will create a dataset with a unique set of SHOP_NUMBER and BEAN_COLOR values.
But what of the situation where we create the list of SHOP_NUMBER and BEAN_COLOR values using the following code:
proc sort data=shopdata out=uniquelist nodupkey; by shop_number bean_color; run;
and then use the following code to find the mean number of beans per shop:
proc means data=uniquelist mean; class shop_number; var count; run;
The COUNT variable exists in the dataset UNIQUELIST (it has not been dropped) so the PROC MEANS step will be valid, but what value for COUNT does the step use for SHOP_NUMBER=003 and BEAN_COLOR=GREEN? Is it value 5 or 2? The actual answer is that there is no real way to determine which value is used, sometimes it will be 5 and sometimes it will be 2. The only way to get around this is to be explicit in the selection of records where there may be a multiple. Again, be aware that this is based on an actual example of code so even the best of us get it wrong occasionally.
The lesson of this story is do not use the NODUPKEY option in the SORT procedure unless you are creating a unique list of values based on a variable list, but most of all do not use this list (unless you can be absolutely certain that there is no chance of duplicates) for analysis purposes.
Another option inside the SORT procedure you may have heard of or used is NODUPRECS (also known as NODUP) which has the following in the SAS Online help:
checks for and eliminates duplicate observations
The major difference between the NODUPRECS and the NODUPKEY is that the NODUPKEY uses the variables in the BY statement to identify the unique observations, as opposed to the NODUPRECS option that finds unique records based on all the variable values for the entire record -- in our data above, no records would be removed as they are unique.
I hope this is useful. See you next month.
Updated October 6, 2009