Basic Statistics
Location & Variation
Tests about the Mean
Kaplan-Meier Curve

Return to Homepage

Other interesting pages ...
SAS Tip of the Month
SAS Cheat Sheet
Useful SAS Code
Full SAS Example
Contact Information

Basic Statistics - Part I, Location & Variation

This month is the start on a series of basic statistics and how SAS, and maybe a few other software packages, compute them. This month is a look at statistics dealing with Location and Variation of data. For the purposes of the examples given in this, the first part of the series, the data used will be a random series of heights collected in centimeters:

175, 183, 165, 170, 165, 193, 183, 175, 175, 163, 150, 147

The results presented are from the SAS UNIVARIATE procedure, although the MEANS, SUMMARY and other procedures that produce the same statistics could also be used.

Location

The first section of this series deals with statistics relating to measures of location or central tendency. The three most common tests are Mode, Median and Mean. The following two figures show the relationship between these statistics and their relative position using a frequency curve.


Figure 1


Figure 2

Figure 1 shows a symmetrical curve where the mode, median and mean coincide. Figure 2 shows a skewed frequency curve and the relative positions of the measures of location.

Mode

The mode is defined as the most frequent non-missing value in the data. Data may have two or more most commonly reported values meaning that the data is bimodal. For the case where there is only frequent value the data is considered unimodal. From the height data collected the mode is 175. SAS will compute a unimodal result only.

Median

If the non-missing values are arranges from smallest to largest, the median is defined as the middle observation if the number of non-missing observations are odd, otherwise if the number of non-missing observations are even then it is the number halfway between the two middle observations. The following is the height data arranged in order

147, 150, 163, 165, 165, 170, 175, 175, 175, 183, 183, 193

With the number of observations being even (N=12) the two middle results are found, 170 and 175, so the median is

Mean

This is sometimes known as the Average and is defined as the sum of the non-missing observed values divided by the number of non-missing observations. This is usually denoted by the following formula:

where x is the observed values and n is the number of observations. Using the height data he mean is:

Variation

As well as location, the next important set of statistics refers to how the observations are scattered. This area of statistics are usually called the measures of variation. The two most common statistics are the Range and the Standard Deviation.

Range (and the related Minimum and Maximum)

The Minimum and Maximum values are the lowest and highest non-missing values within the data. The Range is the difference between the Maximum and Minimum values. Using the height data from 1.1, the Minimum=147, Maximum=193 and the Range is 193-147=46.

Variance and Standard Deviation

The variance and standard deviation are statistics that shows how diverse the data is around the mean. The smaller the deviation the closer the data is to the mean. Figure 3 shows what the standard deviation represents.


Figure 3

One standard deviation either side of the mean, represented as the green area in Figure 3, accounts for around 68 percent of a sample. Two standard deviations either side of the mean, represented as the yellow and green areas in Figure 3, accounts for around 95 percent of a sample. And three standard deviations either side of the mean, represented as the red, yellow and green areas in Figure 3, accounts for around 99 percent of a sample.

The variance is the sample mean squared deviation and is defined as

and is represented more commonly in books as

The reason for the n-1 divisor is due to the theoretical argument that n-1 gives a better estimate of the standard deviation than using n.

Since the variance is given in squared units, the square root of the variance would be given in units. Thus the square root of the variance is the sample standard deviation and thus has the formula (a)

After some algebra the formula (b)

and after a little bit more the formula can be written as (c)

I have presented the three formula here as these are the most common found in textbooks, although (a) is the most common. Formula (b) is more often termed the “Calculator” method as it only needs one pass of the data to do the calculation but this can cause rounding problems particularly when the number of observations get large, hence most statistical programs use (a) for their calculations. This was not always the case with some, notably Excel that used formula (b) right up until version 2003! SAS uses formula (a) for its calculation.

Now using the height data given above and using formula (a) for the computation the following calculations would be done:

        Obs  x    x-mean(x)  (x-mean(x))
        ---------------------------------
         1   175     4.67        21.81
         2   183    12.67       160.53
         3   165    -5.33        28.41
         4   170    -0.33         0.11
         5   165    -5.33        28.41
         6   193    22.67       513.93
         7   183    12.67       160.53
         8   175     4.67        21.81
         9   175     4.67        21.81
        10   163    -7.33        53.73
        11   150   -20.33       413.31
        12   147   -23.33       544.29
        ---------------------------------
        N(x)=12
        SUM(x)=2044
        MEAN(x)=2044/12=170.33
        SUM((x-mean(x)))=1968.68

Hence the Standard Deviation is 13.38. Note that in the example had calculations rounded to a low precision - when using a software package like SAS the precision used is far higher so the accuracy will be better. The same result will come out if the Standard Deviation was calculated using formulas (b) or (c), but the sum of the observation values may exceed what a calculator may allow, particularly if the values are large.

SAS Code

As stated at the start of this paper, these statistics can be calculated using the SAS UNIVARIATE procedure, although the MEANS, SUMMARY and other procedures that produce the same statistics can also be used. The SAS code needed is:

        proc univariate;
            var height;
        run;

The output from this program is shown below:


                       The UNIVARIATE Procedure
                           Variable:  height

                                Moments

    N                          12    Sum Weights                 12
    Mean               170.333333    Sum Observations          2044
    Std Deviation      13.3779556    Variance            178.969697
    Skewness           -0.2639558    Kurtosis             -0.153035
    Uncorrected SS         350130    Corrected SS        1968.66667
    Coeff Variation     7.8539857    Std Error Mean      3.86188314


                       Basic Statistical Measures

             Location                    Variability

         Mean     170.3333     Std Deviation           13.37796
         Median   172.5000     Variance               178.96970
         Mode     175.0000     Range                   46.00000
                               Interquartile Range     15.00000


                       Tests for Location: Mu0=0

            Test           -Statistic-    -----p Value------

            Student's t    t  44.10629    Pr > |t|    <.0001
            Sign           M         6    Pr >= |M|   0.0005
            Signed Rank    S        39    Pr >= |S|   0.0005


                        Quantiles (Definition 5)

                         Quantile      Estimate

                         100% Max         193.0
                         99%              193.0
                         95%              193.0
                         90%              183.0
                         75% Q3           179.0
                         50% Median       172.5
                         25% Q1           164.0
                         10%              150.0
                         5%               147.0
                         1%               147.0
                         0% Min           147.0


                          Extreme Observations

                  ----Lowest----        ----Highest---

                  Value      Obs        Value      Obs

                    147       12          175        8
                    150       11          175        9
                    163       10          183        2
                    165        5          183        7
                    165        3          193        6

From the output the statistics discussed above can be retrieved. The only statistic that may be hard to find is the Minimum and Maximum values but this can be found in the Quantiles section of the output. In this case the Minimum value is 147 and the Maximum value is 193.

I hope this short paper has been useful. In coming months other papers in this mini-series will be presented.


Created March 31, 2006
Updated January 11, 2011