Return to Archive

SAS Tip of the Month
April 2009

Last month I presented a paper at SAS Global Forum on merging data eight different ways (the link to the paper is here). One of the common questions raised during the Question and Answer session around the posters was "You show for one variable, TRT_CODE, but lets say I want to add multiple variables, like AGE and SEX, with the PROC FORMAT statement?" This month I will address that question.

Lets first look at some data -- for this example I will load the data using a CARDS statement:

   data PATDATA;
      infile CARDS;
      input SUBJECT $ 1-6
            TRT_CODE $ 8
            AGE 10-11
            SEX $ 13;
   cards;
   124263 A 25 M
   124264 A 31 F
   124265 B 30 M
   124266 B 26 F
   ;
   run;

   data ADVERSE;
      infile CARDS;
      input SUBJECT $ 1-6
            EVENT $ 8-40;
   cards;
   124263 HEADACHE
   124266 FEVER
   124266 NAUSEA
   124267 FRACTURE
   run;    

When merging the data this way, it is necessary to work out which dataset has a unique key for the merging variable -- in this case the dataset PATDATA is unique.

The next step is create the format from the PATDATA dataset. The following code will do this:

*** To create a format from a dataset there needs to be four variables
    present -- FMTNAME (format name), TYPE (type of format, usually
    (C)haracter or (N)umeric), START (value), and LABEL (decoded value).
    For this example will also create a HLO variable -- this will signal
    that OTHER is to be set, and for our purpose it will be set to blank
    (this makes sure that subjects in ADVERSE not in PATDATA have a
    blank value coming through.;
data FMT;

   *** Define START (subject number), out LABEL variable (this will
       contain TRT_CODE, AGE and SEX) and HLO (OTHER flag);
   length START $8 LABEL $200 HLO $1;

   *** FMTNAME is the Format Name, and the TYPE is character;
   retain FMTNAME 'TRT_FMT' TYPE 'C';

   *** Get our data and create the START and LABEL variable;
   set PATDATA end=EOF;
   START=cats(SUBJECT);
   LABEL=catx('|',TRT_CODE,put(AGE,best.),SEX);

   *** Output each record;
   output;

   *** Create one last record to deal with OTHER;
   if EOF then do;
      START='**OTHER**';
      call missing(LABEL);
      HLO='O';
      output;
   end;
run;

*** Create the format;
proc format cntlin=FMT;
run;

*** Do the merge;
data ALLDATA0;

   *** Read in each record of ADVERSE dataset;
   set ADVERSE;

   *** Create the variables to contain Treatment Code
       (TRT_CODE), Age (AGE) and SEX (Sex);
   attrib TRT_CODE length=$1 label='Treatment Code'
          AGE length=8 label='Age (years)'
          SEX length=$1 label='Sex';

   *** Now get TRT_CODE, AGE and SEX from format,
       using the SCAN function to get the correct data
       into the variable. Will use a temporary variable
       _TMP to contain the value from format. Also, will
       only need to extract if format comes across.;
   length _TMP $200;
   drop _TMP;
   _TMP=put(SUBJECT,$TRT_FMT.);
   if ^missing(strip(_TMP)) then do;
      TRT_CODE=scan(_tmp,1,'|');
      AGE=input(scan(put(subject,$trt_fmt.),2,'|'),best.);
      SEX=scan(put(SUBJECT,$TRT_FMT.),3,'|');
   end;
run;  

Using the PROC FORMAT as in this example is a very interesting method to merge two datasets, from a one-to-one or one-to-many. I have found in the past that the efficiency of this method increases as the number of observations increase when compared against either a MERGE statement inside a datastep or using PROC SQL.

I hope you find this interesting.

________________________________
Updated April 2, 2009