Return to Archive

SAS Tip of the Month
December 2009

I recently had an email from Nancy in New Jersey on how to do a many-to-many merge in SAS without using SQL, so for this month is a discussion on this issue that often occurs when dealing with data.

Lets look at some data first -- this is in the form of SAS card statements (you could also use a DATALINES statement, but for this example I am using the older CARDS statement) so you can copy this data an load it into your SAS Program window:

   *----------------------------------------------------*;
   * Lets load some data;
   *----------------------------------------------------*;
   data ae;  ** Adverse Event Data;
      infile cards;
      input ptnum $ 1-3 @5 date date9. event $ 15-35;
      format date date9.;
   cards;
   001 16NOV2009 Nausea
   002 16NOV2009 Heartburn
   002 16NOV2009 Acid Indigestion
   002 18NOV2009 Nausea
   003 17NOV2009 Fever
   003 18NOV2009 Fever
   005 17NOV2009 Fever
   ;
   run;

   data cm;  ** Concomitant Medication Data;
      infile cards;
      input ptnum $ 1-3 @5 date date9. medication $ 15-35;
      format date date9.;
   cards;
   001 16NOV2009 Dopamine
   002 16NOV2009 Antacid
   002 16NOV2009 Sodium bicarbonate
   002 18NOV2009 Dopamine
   003 18NOV2009 Asprin
   004 19NOV2009 Asprin
   005 17NOV2009 Asprin
   ;
   run;

The merge that is going to be done here is a merge by PTNUM (subject number) and DATE.

Some notes on the data:

  • Subjects 001 and 005 have a one-to-one match.
  • Subject 002 has a many to many match on 16NOV2009, but a one to one match on 18NOV2009.
  • Subject 003 has no match for 17NOV2009 but does have a match for 18NOV2009.
  • Subject 004 has no match.

Lets jump ahead to what we are looking for when we merge the two datasets -- note that we are doing an inner join, i.e. where there is no match the observation is excluded:

   Obs    ptnum         date    event               medication

    1      001     16NOV2009    Nausea              Dopamine
    2      002     16NOV2009    Heartburn           Antacid
    3      002     16NOV2009    Heartburn           Sodium bicarbonate
    4      002     16NOV2009    Acid Indigestion    Antacid
    5      002     16NOV2009    Acid Indigestion    Sodium bicarbonate
    6      002     18NOV2009    Nausea              Dopamine
    7      003     18NOV2009    Fever               Asprin
    8      005     17NOV2009    Fever               Asprin

When first looking at merging the two datasets, the first thought may be to use a simple MERGE statement as shown below:

   ** SAS Code;
   *----------------------------------------------------*;
   * This will not work!;
   *----------------------------------------------------*;
   data all0;
      merge ae cm;
      by ptnum date;
   run;
   title1 "Merge using the MERGE statement -- this fails";
   proc print data=all0;
   run;

   ** SAS Output;
   Merge using the MERGE statement -- this fails

   Obs    ptnum         date    event               medication

    1      001     16NOV2009    Nausea              Dopamine
    2      002     16NOV2009    Heartburn           Antacid
    3      002     16NOV2009    Acid Indigestion    Sodium bicarbonate
    4      002     18NOV2009    Nausea              Dopamine
    5      003     17NOV2009    Fever
    6      003     18NOV2009    Fever               Asprin
    7      004     19NOV2009                        Asprin
    8      005     17NOV2009    Fever               Asprin

As can be seen, this clearly does not work, so how can a many to many merge be done successfully?

The most common way that this match is done is with an SQL call, as the following code demonstrates:

   ** SAS Code;
   *----------------------------------------------------*;
   * SQL, the most common way that it is seen done;
   *----------------------------------------------------*;
   proc sql;
      create table all0 as
         select a.*, b.medication
         from ae a inner join cm b
         on a.ptnum=b.ptnum and
            a.date=b.date;
      quit;
   run;
   title1 "Merge using SQL -- most common way this is seen done";
   proc print data=all0;
   run;

   ** SAS Output;
   Merge using SQL -- most common way this is seen done

   Obs    ptnum         date    event               medication

    1      001     16NOV2009    Nausea              Dopamine
    2      002     16NOV2009    Heartburn           Antacid
    3      002     16NOV2009    Heartburn           Sodium bicarbonate
    4      002     16NOV2009    Acid Indigestion    Antacid
    5      002     16NOV2009    Acid Indigestion    Sodium bicarbonate
    6      002     18NOV2009    Nausea              Dopamine
    7      003     18NOV2009    Fever               Asprin
    8      005     17NOV2009    Fever               Asprin

This works, but it is not what we want -- we do not want to use SQL in the solution.

This brings about the use of the POINT option in the SET statement -- take a look at the following code and try it out in your Program window:

   ** SAS Code;
   *----------------------------------------------------*;
   * The datastep with POINT option -- not often seen,;
   * but gives most control.  Key to this is that the;
   * datastep takes each observation in AE, and then;
   * tries to match this with each observation in;
   * CM -- this is basically a loop within a loop!;
   *----------------------------------------------------*;
   data all1;
      set ae;
      drop _:;  ** Drop temporary variables;
      match=0;  ** Match flag;

      ** Our loop within a loop -- output if match;
      do i=1 to xnobs;
         ** Need to rename the "merging" variables within the CM
            dataset;
         set cm (rename=(ptnum=_ptnum date=_date)) nobs=xnobs point=i;
            ** Have to rename matching variables so that they do not overwrite
               the original values in AE;
         if ptnum=_ptnum and date=_date then do;
            match=1;  ** Yes, there is a match my the "merging" variables;
            output;
         end;
      end;

   run;
   title1 "Merge using using the POINT option in a SET statement";
   proc print data=all1;
   run;

   ** SAS Output;
   Merge using using the POINT option in a SET statement

   Obs    ptnum         date    event               match    medication

    1      001     16NOV2009    Nausea                1      Dopamine
    2      002     16NOV2009    Heartburn             1      Antacid
    3      002     16NOV2009    Heartburn             1      Sodium bicarbonate
    4      002     16NOV2009    Acid Indigestion      1      Antacid
    5      002     16NOV2009    Acid Indigestion      1      Sodium bicarbonate
    6      002     18NOV2009    Nausea                1      Dopamine
    7      003     18NOV2009    Fever                 1      Asprin
    8      005     17NOV2009    Fever                 1      Asprin

This is a lot of code. An important note here is that I have had to rename the matching variables in CM so that they do not overwrite the original values in AE (very important) but use the DROP statment to get rid of these when the dataset ALL1 is created. Note also that I have printed out the variable MATCH so that it is easy to see where the match is.

Yippie, the output is what we have expected AND the we have not used SQL but instead used a datastep.

There are a couple of other ways to do this merge without SQL, notably a variation on the "Loading Unique Dataset into an Array" which I presented first at NESUG in 2008, which can be seen here, but the technique shown above is a shorter method with similar control (however the use of the array will do a many to many merge with the equivilant SQL OUTER JOIN). Personally I prefer the datastep approach given above since it has the most control.

Hope this helps. Have a very happy and safe holiday season. See you all next year.

________________________________
Updated December 1, 2009