Students “t”

The purpose of this and previous blogs is to show that statistics is basically linear algebra and intrinsically simple. Thus in previous posts I showed that data gathered can be simply expressed by lines, the lengths of which represent the mean and the standard deviation of the data respectively.  It was also established that the two lines are independent of one another. If we actually knew that they absolutely represented the true values of each, we would be through. However, that is not the case. The mean is always sort of “fuzzy” in that we can’t be sure that what we measured is the true value. Measuring uncertainty is where it gets complicated. There is usually uncertainty with the standard deviation, but not always.  Data sources from facilities that routinely manufacture a product may have sufficient data on the standard deviations to be able to assume their data represents the true value.

True Value of the Standard Deviation is Known.

 In this case the normal distribution is all that is needed to evaluate the uncertainty of the mean.


True Value of the Standard Deviation is Not Known

Usually standard deviations are also fuzzy thus both the mean and the standard deviations can be considered to be random variables. While the mean is normally distributed, the square of the standard deviation (variance) is distributed according to the chi squared distribution. (The chi squared distribution with one degree of freedom is the square of the normal distribution.) However, the distribution that we want is that of the mean divided by the standard deviation, both of which being random variable.

Derivation of the t” Distribution

 The complication is what we need now is the distribution of the normal distribution divided by the square root of the chi square distribution. The equation for that is:

 f(z) = Integral |x|f(x)f(zx)dx

where zx is the normal distribution, x is the square root of the chi square distribution and z is the “t”  distribution.

 The calculation of the derivation of the “t” distribution may be found in Statistical Inference, Vijay K. Rohatgi, John Wiley & Sons, 1984.

So we see that while the basic concepts of statistics is simple, the problem of uncertainty is complex.


Reliability of Data

In a previous entry I showed that the basic concepts of quality control, which depends upon the laws of probability (statistics), are surprisingly simple. All that we are trying to do is measure lengths of lines. The equations used to calculate the mean and standard deviation are those that describe only two lines so that no matter how many samples are tested, the calculations of those parameters result in just those two lines which are independent of each other. While “n” data points occupy “n” dimensions, the mean and standard deviation occupy only two. We can use the standard deviation as the ruler to measure the lengths of interest.

What makes things difficult is the fuzziness of those lines. In quality control the first thing we want to determine is the length of the distance from the measured length (sample mean) to some desired length. To do that we use a ruler in which the standard deviation is set to be one. For convenience, and because the standard deviation is defined as the second moment around the mean, the targeted mean is subtracted from the data points so that the resulting length of the data vector is reduced to the difference between the sample mean and the target. That length is then divided by the standard deviation. The resulting length is then measured not in inches or millimeters but rather in units of the standard deviation ruler. As an example, assume that 100 was the target value, the measured mean was 85 and the standard deviation was 10. We are not interested in what the actual measured mean is, but rather how close it is to the target, based upon the standard deviation ruler:

1. (100-85)/10 = a distance of 1.5 SD units. In some cases the measurement is not from the desired target, but to upper and lower limits.

However, the mean value is fuzzy and the standard deviation may or may not be fuzzy. The data generated in calculating the mean make up a random variable (X= (x1, x2, —, xn)) in vector space. How fuzzy it is depends upon the length of the SD, and the type of distribution. While there are many distributions, if the SD is not fuzzy, what is called the normal distribution is often used. Because of the uncertainty in the mean, the distribution function tells us the chances of the mean actually being somewhere else.  In example 1 with only the mean being fuzzy, and using the normal distribution, we can say that there is a 6.68% chance that the true mean of the data is the desired mean.

Unfortunately, the SD often is fuzzy too and is thus also a random variable. The square of the SD is called the variance, and has its own distribution function called the chi squared distribution. While the normal distribution is independent of the number of data points defining the random variable, the form of the chi squared distribution depends upon the degrees of freedom. The chi square distribution with one degree of freedom is the square of the normal distribution. That distribution may be used to determine whether two measured standard deviations are really the same.

How the fuzziness or uncertainty is handled will be covered later. Although the mathematics gets more complex, especially when multivariate sets of data must be considered, the goal is still to simply measure lengths with a specific ruler.