of the Month
Last month I presented a paper at SAS Global Forum on merging data eight different ways (the link to the paper is here). One of the common questions raised during the Question and Answer session around the posters was "You show for one variable, TRT_CODE, but lets say I want to add multiple variables, like AGE and SEX, with the PROC FORMAT statement?" This month I will address that question.
Lets first look at some data -- for this example I will load the data using a CARDS statement:
data PATDATA; infile CARDS; input SUBJECT $ 1-6 TRT_CODE $ 8 AGE 10-11 SEX $ 13; cards; 124263 A 25 M 124264 A 31 F 124265 B 30 M 124266 B 26 F ; run; data ADVERSE; infile CARDS; input SUBJECT $ 1-6 EVENT $ 8-40; cards; 124263 HEADACHE 124266 FEVER 124266 NAUSEA 124267 FRACTURE run;
When merging the data this way, it is necessary to work out which dataset has a unique key for the merging variable -- in this case the dataset PATDATA is unique.
The next step is create the format from the PATDATA dataset. The following code will do this:
*** To create a format from a dataset there needs to be four variables present -- FMTNAME (format name), TYPE (type of format, usually (C)haracter or (N)umeric), START (value), and LABEL (decoded value). For this example will also create a HLO variable -- this will signal that OTHER is to be set, and for our purpose it will be set to blank (this makes sure that subjects in ADVERSE not in PATDATA have a blank value coming through.; data FMT; *** Define START (subject number), out LABEL variable (this will contain TRT_CODE, AGE and SEX) and HLO (OTHER flag); length START $8 LABEL $200 HLO $1; *** FMTNAME is the Format Name, and the TYPE is character; retain FMTNAME 'TRT_FMT' TYPE 'C'; *** Get our data and create the START and LABEL variable; set PATDATA end=EOF; START=cats(SUBJECT); LABEL=catx('|',TRT_CODE,put(AGE,best.),SEX); *** Output each record; output; *** Create one last record to deal with OTHER; if EOF then do; START='**OTHER**'; call missing(LABEL); HLO='O'; output; end; run; *** Create the format; proc format cntlin=FMT; run; *** Do the merge; data ALLDATA0; *** Read in each record of ADVERSE dataset; set ADVERSE; *** Create the variables to contain Treatment Code (TRT_CODE), Age (AGE) and SEX (Sex); attrib TRT_CODE length=$1 label='Treatment Code' AGE length=8 label='Age (years)' SEX length=$1 label='Sex'; *** Now get TRT_CODE, AGE and SEX from format, using the SCAN function to get the correct data into the variable. Will use a temporary variable _TMP to contain the value from format. Also, will only need to extract if format comes across.; length _TMP $200; drop _TMP; _TMP=put(SUBJECT,$TRT_FMT.); if ^missing(strip(_TMP)) then do; TRT_CODE=scan(_tmp,1,'|'); AGE=input(scan(put(subject,$trt_fmt.),2,'|'),best.); SEX=scan(put(SUBJECT,$TRT_FMT.),3,'|'); end; run;
Using the PROC FORMAT as in this example is a very interesting method to merge two datasets, from a one-to-one or one-to-many. I have found in the past that the efficiency of this method increases as the number of observations increase when compared against either a MERGE statement inside a datastep or using PROC SQL.
I hope you find this interesting.
Updated April 2, 2009