Basic Statistics
Location & Variation
Tests about the Mean
Kaplan-Meier Curve

Return to Homepage

Other interesting pages ...
SAS Tip of the Month
SAS Cheat Sheet
Useful SAS Code
Full SAS Example
Contact Information

Basic Statistics - Part II, Tests About the Mean of a Normal Distribution (t-test)

Sometimes the question is asked as to whether two samples have a similar mean. Just looking at a listing of the data does not give a definitive answer - some sort of test is required.

In order to do a test it is important to formulate what statisticians call an "hypothesis", that is to ask a question (often called the null hypothesis and denoted by H0) and see if it is true, versus asking the same question (often called the alternative hypothesis and denoted by H1) and seeing if it is false.

There are tests for where the population standard deviation is known but this very rarely occurs. So this article will jump straight to the situation where the standard deviation of the population is unknown. This is where the t-test is used to do the test on hypothesis.

SINGLE SAMPLE

The first part of this paper will look at the procedure where there is a single sample and uses the sample mean to test the hypothesis. The process for the test is summarized below:

  1. First step is to state the null and alternative hypothesis. As the test on the mean the question being asked, for a two-tailed test, is

    H0: μ = μ0
    H1: μ ≠ μ0

    For a single tailed test the alternative hypothesis would be written as

    H1: μ > μ0

    or

    H1: μ < μ0

    depending on the question being asked.

  1. The next step in the procedure is to decide what the level of significance should be, usually noted as α in most text books, and calculates what the boundaries of the critical region on the t curve are. If a two sided test is required then α value is split evenly between the two sides of the t curve. To find the critical t-values it is necessary to consult a t-test statistical table, an example of which is given below:

    
            t-Distribution
    
                                    significance
             Degrees    ------------------------------------
            of Freedom  0.100  0.050   0.025   0.010   0.005
            ------------------------------------------------
                 1      3.078  6.314  12.706  31.821  63.657
                 2      1.886  2.920   4.303   6.965   9.925
                 3      1.638  2.353   3.182   4.541   5.841
                 4      1.533  2.132   2.776   3.747   4.604
                 5      1.476  2.015   2.571   3.365   4.032
                 6      1.440  1.943   2.447   3.143   3.707
                 7      1.415  1.895   2.365   2.998   3.499
                 8      1.397  1.860   2.306   2.896   3.355
                 9      1.383  1.833   2.262   2.821   3.250
                10      1.372  1.812   2.228   2.764   3.169
                11      1.363  1.796   2.201   2.718   3.106
                12      1.356  1.782   2.179   2.681   3.055
                13      1.350  1.771   2.160   2.650   3.012
                14      1.345  1.761   2.145   2.624   2.977
                15      1.341  1.753   2.131   2.602   2.947
                16      1.337  1.746   2.120   2.583   2.921
                17      1.333  1.740   2.110   2.567   2.898
                18      1.330  1.734   2.101   2.552   2.878
                19      1.328  1.729   2.093   2.539   2.861
                20      1.325  1.725   2.086   2.528   2.845
                21      1.323  1.721   2.080   2.518   2.831
                22      1.321  1.717   2.074   2.508   2.819
                23      1.319  1.714   2.069   2.500   2.807
                24      1.318  1.711   2.064   2.492   2.797
                25      1.316  1.708   2.060   2.485   2.787
                26      1.315  1.706   2.056   2.479   2.779
                27      1.314  1.703   2.052   2.473   2.771
                28      1.313  1.701   2.048   2.467   2.763
                29      1.311  1.699   2.045   2.462   2.756
                30      1.310  1.697   2.042   2.457   2.750

    Before the correct t-value can be obtained it is necessary to decide on the "degrees of freedom", which is equal to n-1 - if there are 15 observations then the degrees of freedom that is consulted on a table is 15-1=14. If the test is a two-sided and uses an α level of 0.10 then the boundary is t0.05=1.761 and -t0.05=-1.761.

    1. The only real computation to do in this test is calculate a value for t from the data using the following formula:

    (1.1)

    1. With the critical boundaries known and a value of t computed from the formula above it is now decision time - the hypothesis H1 is rejected if the value t is in the critical region, otherwise accept the hypothesis H0.

    The following examples demonstrate the procedure used for the tests.

    Example 1

    A sample of eight bottles of a certain product were taken and their liquid content measured – the results are below:

    369, 357, 356, 364, 348, 361, 345, 364

    The researcher wants to test the null hypothesis that the mean equals 355 versus the alternative that it does not. Let α = 0.01.

    1. H0: μ = 355, H1: μ ≠ 355

    2. As this is a two-tailed test, and α=0.01 with (8-1)=7 degrees of freedom then from the t-table t0.005=3.499, -t0.005=-3.499.

    3. Do computations

    1. As -3.499 < 1.029 < 3.499 then accept H0.

    Now looking at SAS, how does the same test get done. There is a procedure called PROC TTEST that will do the calculations however the default output and interpretation are very different. Using the same data as in example 1 (loading it into a dataset called PRODA with a variable VOLUME) and running the following code

    
        20    proc ttest data=proda;
        21        var volume;
        22    run;

    will produce the following output

    
                                           The TTEST Procedure
    
                                                Statistics
    
                          Lower CL            Upper CL   Lower CL             Upper CL
       Variable       N       Mean     Mean       Mean    Std Dev   Std Dev    Std Dev   Std Err
    
       volume         8     346.25      358     369.75     4.6472    8.2462     24.477    2.9155
    
    
                                                 T-Tests
    
                                 Variable      DF    t Value    Pr > |t|
    
                                 volume         7       1.03      0.3377

    From the output, how is a decision made as to whether accept or reject the hypothesis? The number to look for is under the label "Pr > |t|", and to read this correctly the number is compared against the significance level - if the number is greater than or equal to the significance then accept H0, otherwise accept H1. In this case the significance was 0.05 (0.10/2=0.05 since two sided test) and as 0.3377 > 0.005 then accept H0.

    Is there a way to check that the p-value from the TTEST procedure is acceptable or that the correct hypothesis was chosen? After all the way the procedure for determining which hypothesis to choose is well known and in countless textbooks? There is a function available in BASE SAS that can give the p-value from the t value computed in Example 1 and is shown in SAS data step below with the result:

    
        31    data _null_;
        32       x=1.029;
        33       df=7;
        34       p=(1-probt(abs(x),df))*2; /*significance level of a two-tailed t test*/
        35       put p=;
        36    run;
        p=0.3377168268

    The t-test can also be done using the UNIVARIATE procedure for the Single Sample case and using the MU0= option as the following SAS code and output shows (look at the Tests for Location section, Student's t result):

    
        393   proc univariate data=proda mu0=355;
        394       var volume;
        395   run;
    
                            The UNIVARIATE Procedure
                                Variable:  volume
    
                                    Moments
    
        N                           8    Sum Weights                  8
        Mean                      358    Sum Observations          2864
        Std Deviation      8.24621125    Variance                    68
        Skewness            -0.480995    Kurtosis            -0.7557588
        Uncorrected SS        1025788    Corrected SS               476
        Coeff Variation    2.30341096    Std Error Mean      2.91547595
    
    
                           Basic Statistical Measures
    
                 Location                    Variability
    
             Mean     358.0000     Std Deviation            8.24621
             Median   359.0000     Variance                68.00000
             Mode     364.0000     Range                   24.00000
                                   Interquartile Range     12.00000
    
    
                          Tests for Location: Mu0=355
    
                Test           -Statistic-    -----p Value------
    
                Student's t    t  1.028992    Pr > |t|    0.3377
                Sign           M         2    Pr >= |M|   0.2891
                Signed Rank    S         7    Pr >= |S|   0.3672

    It is also possible to do the calculations using data step code within BASE SAS, as shown below, and get an output similar to the output below::

    
        data _null_;
            ...SAS Statements...
            tsigl=-abs(tinv(alpha,df));
            tsigh=abs(tinv(alpha,df));
            tval=(mean-mju)/(std/sqrt(n));
            p=(1-probt(abs(tval),df))*2;
            ...more SAS Statements...
        run;
    
        --- Output ---
        T-TEST
    
        Dataset = PRODA
        Variable = volume
    
        H0 = 355 , H1 ^= 355
        alpha = 0.01 (2-sided test: alpha/2=0.005)
    
        N=   8
        Mean=358
        S=   8.2462112512
        DF=  7
    
        -3.499483297 < 1.0289915109 < 3.4994832974 : accept H0, reject H1
        Pr > |t| = 0.3377205477

    For the data step method I have as a macro in my macro collection – every programmer should carry around with them something that contains their useful of frequently used code.

    Example 2

    A company took a random sample of ten components and clocked the duration a machine took to recondition and inspect the each component (in seconds), the times of which were

    5.7, 4.8, 5.9, 4.9, 6.1, 4.2, 6.5, 6.4, 5.8, 5.7

    The goal is to have an average time of 5 seconds. Using a significance level of 0.01 was the goal met?

    1. H0: μ = 5, H1: μ >5 (only concerned if the average time is greater than 5 seconds)

    2. As this is a single-tailed test and α=0.01 with (10-1)=9 degrees of freedom then from the t-table t0.01=2.821.

    3. Computation

    1. As 2.564<2.821 then accept H0.

    In the previous example, had the significance level been 0.05 (t0.05=1.833) then the result would have been quite different, 1.833<=2.564 then reject H0 and accept H1.

    Using SAS and the UNIVARIATE procedure (MJU=5) the output is:

    
                            The UNIVARIATE Procedure
                                Variable:  time
    
                                    Moments
    
        N                          10    Sum Weights                 10
        Mean                      5.6    Sum Observations            56
        Std Deviation      0.74087036    Variance            0.54888889
        Skewness           -0.7500206    Kurtosis            -0.2612091
        Uncorrected SS         318.54    Corrected SS              4.94
        Coeff Variation    13.2298278    Std Error Mean      0.23428378
    
    
                           Basic Statistical Measures
    
                 Location                    Variability
    
             Mean     5.600000     Std Deviation            0.74087
             Median   5.750000     Variance                 0.54889
             Mode     5.700000     Range                    2.30000
                                   Interquartile Range      1.20000
    
    
                           Tests for Location: Mu0=5
    
                Test           -Statistic-    -----p Value------
    
                Student's t    t  2.560997    Pr > |t|    0.0306
                Sign           M         2    Pr >= |M|   0.3438
                Signed Rank    S        19    Pr >= |S|   0.0488

    A decision is made as to accept H0 or H1 by comparing the "Pr > |t|" value against the significance level – in this case 0.01 < 0.0306 so accept H0.

    TWO SAMPLES

    The second part of this paper will discuss using the t-test to compare two means with an unknown but assumed common population variance.

    The hypothesis that is usually tested is that the means are equal, denoted by H0: μ12. The hypothesis can also be rewritten as H012=0 - this is useful as it is then possible to easily write the test to check if it is larger or smaller by a specified value, sometimes denoted in textbooks as ∆.

    The test is very much the same as for the single sample except that the calculation for t is now:

    (2.1)

    where

    (2.2)

    Summarizing the procedure the process would be:

    1. First step is to state the null and alternative hypothesis. As the test on the mean the question being asked, for a two-tailed test, is

      H0: μ1 - μ2 = ∆
      H1: μ1 - μ2 ≠ ∆

      For a single tailed test the alternative hypothesis would be written as

      H1: μ1 - μ2 >∆

      or

      H1: μ1 - μ2 <∆

      depending on the question being asked.

    1. The next step in the procedure is to decide what the level of significance should be as above.

      Before the correct t-value can be obtained it is necessary to decide on the "degrees of freedom", which is equal to n1 + n2 – 2 - if there are 7 observations in sample 1, 8 observations in sample 2, then the degrees of freedom that is consulted on a table is 7+8-2=13. If the test is a two-sided and uses an α level of 0.10 then the boundary is t0.05=1.771 and -t0.05=-1.771.

    1. The only real computation to do in this test is calculate a value for t from the data using calculations 2.1 and 2.2.

    1. With the critical boundaries known and a value of t computed from the formula above it is now decision time - the hypothesis H1 is rejected if the value t is in the critical region, otherwise accept the hypothesis H0.

    The following examples demonstrate the procedure used for the tests.

    Example 3

    A sample of free range eggs from two farms were sought and a test was asked for to determine if the mean weight (ounces) of the eggs from the two farms are the same using a significance of 0.01:

        Farm A: 20, 28, 24, 20, 24, 21, 17, 28, 25, 19
        Farm B: 29, 16, 25, 27, 27, 18, 22, 27
    1. H0: μ1 - μ2 = 0, H1: μ1 - μ2 ≠ 0

    2. As this is a two-tailed test, and α=0.01 with (10+8-2)=16 degrees of freedom then from the t-table t0.005=2.921, -t0.005=-2.921.

    3. Do computations

    1. As -2.921 < -0.637 < 2.921 then accept H0.

    To calculate the t-test using SAS procedures it is not possible to use the BASE SAS procedures but instead use others, for example TTEST from the SAS/STAT module. Using the data above and using the variable FARM to indicate where the sample came from, WTOZ as the weight in ounces, and the following code:

    
        proc ttest data=eggs0;
            class farm;
            var wtoz;
        run;

    the following output appears:

    
                                               The TTEST Procedure
    
                                                   Statistics
    
                                       Lower CL          Upper CL  Lower CL           Upper CL
        Variable  farm              N      Mean    Mean      Mean   Std Dev  Std Dev   Std Dev  Std Err
    
        wtoz                 1     10    19.898    22.6    25.302     2.598   3.7771    6.8956   1.1944
        wtoz                 2      8    19.917  23.875    27.833      3.13    4.734     9.635   1.6737
        wtoz      Diff (1-2)             -5.521  -1.275     2.971    3.1448   4.2225    6.4264   2.0029
    
    
                                                     T-Tests
    
                      Variable    Method           Variances      DF    t Value    Pr > |t|
    
                      wtoz        Pooled           Equal          16      -0.64      0.5334
                      wtoz        Satterthwaite    Unequal      13.3      -0.62      0.5457
    
    
                                              Equality of Variances
    
                          Variable    Method      Num DF    Den DF    F Value    Pr > F
    
                          wtoz        Folded F         7         9       1.57    0.5173

    As with the UNIVARIATE procedure above the result to look at is the “ Pr> |t|” value in the row using the method Pooled (if the assumption is that the two populations have the same variance then the “Pooled” method value is used, otherwise the Satterthwaite method value is used – the distinction is too advanced for this paper and if the reader is interested they should refer to the SAS documentation). As with the single sample method the decision is made as to accept H0 or H1 comparing the "Pr > |t|" value against the significance level – in this case 0.005 < 0.5334 (0.01/2=0.005 – two tailed test) so accept H0.

    To check if the p-value calculated in the TTEST procedure is the same as the one that was calculated by hand above the following data step code is used:

    
        41   data _null_;
        42       x=-0.637;
        43       df=16;
        44       p=(1-probt(abs(x),df))*2; /*significance level of a two-tailed t test*/
        45       put p=;
        46   run;
        p=0.533133971

    The value 0.5331 is about that of 0.5334 from the TTEST procedure – the difference is due to rounding.

    A good programmer will carry around a piece of SAS code to do this test using BASE SAS that is the manual method plus the calculation for the p-value.

    Example 4

    A sample of free range eggs from two farms were sought and a test was asked for to determine if the mean weight (ounces) of the eggs from Farm A is greater than Farm B by three ounces using a significance of 0.01:

        Farm A: 26, 26, 28, 33, 32, 27, 24, 24
        Farm B: 19, 16, 26, 18, 28, 20, 18, 23, 18, 27
    1. H0: μ1 - μ2 = 3, H1: μ1 - μ2 > 3

    2. As this is a one-tailed test, and α=0.01 with (10+8-2)=16 degrees of freedom then from the t-table t0.01=2.583.

    3. Do computations

    1. As 1.72 < 2.583 then accept H0.

    Using the TTEST procedure and with the following SAS code

    
        proc ttest data=eggs0 H0=3;
            class farm;
            var wtoz;
        run;

    the following output is generated:

    
                                               The TTEST Procedure
    
                                                   Statistics
    
                                       Lower CL          Upper CL  Lower CL           Upper CL
        Variable  farm              N      Mean    Mean      Mean   Std Dev  Std Dev   Std Dev  Std Err
    
        wtoz                 1      8    23.317    27.5    31.683    1.9863   3.3806    8.9927   1.1952
        wtoz                 2     10    16.832    21.3    25.768    2.6853   4.3474    9.9017   1.3748
        wtoz      Diff (1-2)             0.7224     6.2    11.678    2.7016   3.9536     6.974   1.8754
    
    
                                                     T-Tests
    
                      Variable    Method           Variances      DF    t Value    Pr > |t|
    
                      wtoz        Pooled           Equal          16       1.71      0.1073
                      wtoz        Satterthwaite    Unequal        16       1.76      0.0981
    
    
                                              Equality of Variances
    
                          Variable    Method      Num DF    Den DF    F Value    Pr > F
    
                          wtoz        Folded F         9         7       1.65    0.5198

    A decision is made as to accept H0 or H1 by comparing the “Pr > |t|” value against the significance level – in this case 0.005 < 0.1073 so accept H0.


Updated: January 11, 2011