Return to Homepage

Goto the Tip of the Month Archive

Other interesting pages ...
LinkedIn Profile
SAS Cheat Sheet
Useful SAS Code
Full SAS Example
Basic Statistics
Contact Information

SAS Tip of the Month
December 2011
(for SAS and WPS)

I received an email from Rebecca in Boston asking how some values were being overwritten when two datasets were being joined. A few minutes of discussion and the problem was solved -- a common variable was in both datasets but not in the BY statement. Lets look at an example:

   data _sales (label='Sales Hours');
      label client='Client'
            value='Invoiced Amount'
            number='Number of Hours';
      infile cards;
      input client $ value number;
   cards;
   Huey 1200 120
   Dewey 1500 110
   Louie 800 5
   ;
   run;

   data _budgeted (label='Budgeted Hours');
      label client='Client'
            number='Budgeted Hours'
            manager='Manager';
      infile cards;
      input client $ number manager $;
   cards;
   Huey 120 Mickey
   Dewey 80 Donald
   Louie 50 Goofy
   ;
   run;

Here we have two datasets which we are going to merge by client -- it is easy to see in this example that there are two common variables, CLIENT and NUMBER, the former which we are using to merge the data by.

Now lets look at what happens after we merge the two datasets:

   proc sort data=_sales;
      by client;
   proc sort data=_budgeted;
      by client;
   data _merged;
      merge _sales _budgeted;
      by client;
   proc print data=_merged;
      title1 "Merged Sales and Budgeted Numbers";
   run;

   *=== Output ===;

         Merged Sales and Budgeted Numbers

                    Invoiced     Number
   Obs    Client     Amount     of Hours    Manager

    1     Dewey       1500          80      Donald
    2     Huey        1200         120      Mickey
    3     Louie        800          50      Goofy

   *=== LOG ===;

   106  data _merged;
   107     merge _sales _budgeted;
   108     by client;

   NOTE: There were 3 observations read from the data set
         WORK._SALES.
   NOTE: There were 3 observations read from the data set
         WORK._BUDGETED.
   NOTE: The data set WORK._MERGED has 3 observations and 3
         variables.

   109  proc print data=_merged;
   110     title1 "Merged Sales and Budgeted Numbers";
   111  run;

   NOTE: There were 3 observations read from the data set
         WORK._MERGED.

   112  proc print data=_merged label;
   113     title1 "Merged Sales and Budgeted Numbers";
   114  run;

   NOTE: There were 3 observations read from the data set
         WORK._MERGED.

Looking at the LOG and the OUTPUT everything looks reasonable, particularly if you are unfamiliar with what the expected output is. But look at the Number of Hours which the data, according to the label, comes from the _SALES dataset -- these are not the values from _SALES, but the values from _BUDGETED! No one would detect the issue and the report could go out to the public.

Preventing the issue looks easy, just make sure each dataset has unique variable names except those that are being used on the MERGE statement. But it is not so easy if you are merging two datasets where one or both have several hundred variables.

So what can we do to detect this issue in our programs? SAS/WPS does provide an option called MSGLEVEL that controls the level of detail in messages that are written to the SAS LOG. This has two values:

   N - prints notes, warnings, and error messages (default)
   I - prints additional notes that refer to index usage, merge processing,
       sort utilities, and those using option N

Lets use the MSGLEVEL option now with a value of I and see what comes up in the SAS LOG:

   *=== Output ===;

         Merged Sales and Budgeted Numbers

                    Invoiced     Number
   Obs    Client     Amount     of Hours    Manager

    1     Dewey       1500          80      Donald
    2     Huey        1200         120      Mickey
    3     Louie        800          50      Goofy

   *=== LOG ===;

   115  options msglevel=I;
   116  proc sort data=_sales;
   117     by client;

   NOTE: Input data set is already sorted, no sorting done.
   NOTE: PROCEDURE SORT used (Total process time):
         real time           0.03 seconds
         cpu time            0.00 seconds


   118  proc sort data=_budgeted;
   119     by client;

   NOTE: Input data set is already sorted, no sorting done.
   NOTE: PROCEDURE SORT used (Total process time):
         real time           0.00 seconds
         cpu time            0.00 seconds


   120  data _merged;
   121     merge _sales _budgeted;
   122     by client;

   INFO: The variable number on data set WORK._SALES will be
         overwritten by data set WORK._BUDGETED.
   NOTE: There were 3 observations read from the data set
         WORK._SALES.
   NOTE: There were 3 observations read from the data set
         WORK._BUDGETED.
   NOTE: The data set WORK._MERGED has 3 observations and 3
         variables.
   NOTE: DATA statement used (Total process time):
         real time           0.01 seconds
         cpu time            0.01 seconds


   123  proc print data=_merged label;
   124     title1 "Merged Sales and Budgeted Numbers";
   125  run;

   NOTE: There were 3 observations read from the data set
         WORK._MERGED.
   NOTE: PROCEDURE PRINT used (Total process time):
         real time           0.00 seconds
         cpu time            0.00 seconds

The output is the same, but an INFO message now appears in the SAS LOG with the following text:

   INFO: The variable number on data set WORK._SALES will be
         overwritten by data set WORK._BUDGETED.

Please note that the message may not be exactly the same depending on whether what version of SAS or WPS you are using, but the meaning is the same.

This is our only indication in the SAS LOG, without knowing the data or what to expect, that numbers are being overwritten from _BUDGETED to _SALES when the merge occurs -- this is not an ERROR or WARNING message in the SAS LOG. With this information we can amend our programming accordingly so the report is correct either by dropping the NUMBER variable in _SALES or _BUDGETED, or renaming one of the two NUMBER variables and adding an additional column:

             Merged Sales and Budgeted Numbers
        (with NUMBER dropped from _BUDGETED dataset)

                       Invoiced     Number
      Obs    Client     Amount     of Hours    Manager

       1     Dewey       1500         110      Donald
       2     Huey        1200         120      Mickey
       3     Louie        800           5      Goofy


                   Merged Sales and Budgeted Numbers
   (with NUMBER in _BUDGETED renamed to another variable name during MERGE)

                         Invoiced     Number     Budgeted
        Obs    Client     Amount     of Hours      Hours     Manager

         1     Dewey       1500         110          80      Donald
         2     Huey        1200         120         120      Mickey
         3     Louie        800           5          50      Goofy

When you are writing your program, it is useful to use the MSGLEVEL option with the value of I, and always do little checks of the output at each step to make sure that what is being output is what you are expecting!

Have a safe than happy holiday season. See you all next year.

________________________________
Updated December 8, 2011