Basic Statistics - Part I, Location & Variation
This month is the start on a series of basic statistics and how SAS, and maybe a few other software packages, compute them. This month is a look at statistics dealing with Location and Variation of data. For the purposes of the examples given in this, the first part of the series, the data used will be a random series of heights collected in centimeters:
175, 183, 165, 170, 165, 193, 183, 175, 175, 163, 150, 147
The results presented are from the SAS UNIVARIATE procedure, although the MEANS, SUMMARY and other procedures that produce the same statistics could also be used.
The first section of this series deals with statistics relating to measures of location or central tendency. The three most common tests are Mode, Median and Mean. The following two figures show the relationship between these statistics and their relative position using a frequency curve.
Figure 1 shows a symmetrical curve where the mode, median and mean coincide. Figure 2 shows a skewed frequency curve and the relative positions of the measures of location.
The mode is defined as the most frequent non-missing value in the data. Data may have two or more most commonly reported values meaning that the data is bimodal. For the case where there is only frequent value the data is considered unimodal. From the height data collected the mode is 175. SAS will compute a unimodal result only.
If the non-missing values are arranges from smallest to largest, the median is defined as the middle observation if the number of non-missing observations are odd, otherwise if the number of non-missing observations are even then it is the number halfway between the two middle observations. The following is the height data arranged in order
147, 150, 163, 165, 165, 170, 175, 175, 175, 183, 183, 193
With the number of observations being even (N=12) the two middle results are found, 170 and 175, so the median is
This is sometimes known as the Average and is defined as the sum of the non-missing observed values divided by the number of non-missing observations. This is usually denoted by the following formula:
where x is the observed values and n is the number of observations. Using the height data he mean is:
As well as location, the next important set of statistics refers to how the observations are scattered. This area of statistics are usually called the measures of variation. The two most common statistics are the Range and the Standard Deviation.
Range (and the related Minimum and Maximum)
The Minimum and Maximum values are the lowest and highest non-missing values within the data. The Range is the difference between the Maximum and Minimum values. Using the height data from 1.1, the Minimum=147, Maximum=193 and the Range is 193-147=46.
Variance and Standard Deviation
The variance and standard deviation are statistics that shows how diverse the data is around the mean. The smaller the deviation the closer the data is to the mean. Figure 3 shows what the standard deviation represents.
One standard deviation either side of the mean, represented as the green area in Figure 3, accounts for around 68 percent of a sample. Two standard deviations either side of the mean, represented as the yellow and green areas in Figure 3, accounts for around 95 percent of a sample. And three standard deviations either side of the mean, represented as the red, yellow and green areas in Figure 3, accounts for around 99 percent of a sample.
The variance is the sample mean squared deviation and is defined as
and is represented more commonly in books as
The reason for the n-1 divisor is due to the theoretical argument that n-1 gives a better estimate of the standard deviation than using n.
Since the variance is given in squared units, the square root of the variance would be given in units. Thus the square root of the variance is the sample standard deviation and thus has the formula (a)
After some algebra the formula (b)
and after a little bit more the formula can be written as (c)
I have presented the three formula here as these are the most common found in textbooks, although (a) is the most common. Formula (b) is more often termed the “Calculator” method as it only needs one pass of the data to do the calculation but this can cause rounding problems particularly when the number of observations get large, hence most statistical programs use (a) for their calculations. This was not always the case with some, notably Excel that used formula (b) right up until version 2003! SAS uses formula (a) for its calculation.
Now using the height data given above and using formula (a) for the computation the following calculations would be done:
Obs x x-mean(x) (x-mean(x))² --------------------------------- 1 175 4.67 21.81 2 183 12.67 160.53 3 165 -5.33 28.41 4 170 -0.33 0.11 5 165 -5.33 28.41 6 193 22.67 513.93 7 183 12.67 160.53 8 175 4.67 21.81 9 175 4.67 21.81 10 163 -7.33 53.73 11 150 -20.33 413.31 12 147 -23.33 544.29 --------------------------------- N(x)=12 SUM(x)=2044 MEAN(x)=2044/12=170.33 SUM((x-mean(x))²)=1968.68
Hence the Standard Deviation is 13.38. Note that in the example had calculations rounded to a low precision - when using a software package like SAS the precision used is far higher so the accuracy will be better. The same result will come out if the Standard Deviation was calculated using formulas (b) or (c), but the sum of the observation values may exceed what a calculator may allow, particularly if the values are large.
As stated at the start of this paper, these statistics can be calculated using the SAS UNIVARIATE procedure, although the MEANS, SUMMARY and other procedures that produce the same statistics can also be used. The SAS code needed is:
proc univariate; var height; run;
The output from this program is shown below:
The UNIVARIATE Procedure Variable: height Moments N 12 Sum Weights 12 Mean 170.333333 Sum Observations 2044 Std Deviation 13.3779556 Variance 178.969697 Skewness -0.2639558 Kurtosis -0.153035 Uncorrected SS 350130 Corrected SS 1968.66667 Coeff Variation 7.8539857 Std Error Mean 3.86188314 Basic Statistical Measures Location Variability Mean 170.3333 Std Deviation 13.37796 Median 172.5000 Variance 178.96970 Mode 175.0000 Range 46.00000 Interquartile Range 15.00000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 44.10629 Pr > |t| <.0001 Sign M 6 Pr >= |M| 0.0005 Signed Rank S 39 Pr >= |S| 0.0005 Quantiles (Definition 5) Quantile Estimate 100% Max 193.0 99% 193.0 95% 193.0 90% 183.0 75% Q3 179.0 50% Median 172.5 25% Q1 164.0 10% 150.0 5% 147.0 1% 147.0 0% Min 147.0 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 147 12 175 8 150 11 175 9 163 10 183 2 165 5 183 7 165 3 193 6
From the output the statistics discussed above can be retrieved. The only statistic that may be hard to find is the Minimum and Maximum values but this can be found in the Quantiles section of the output. In this case the Minimum value is 147 and the Maximum value is 193.
I hope this short paper has been useful. In coming months other papers in this mini-series will be presented.