Reliability of Data
In a previous entry I showed that the basic concepts of quality control, which depends upon the laws of probability (statistics), are surprisingly simple. All that we are trying to do is measure lengths of lines. The equations used to calculate the mean and standard deviation are those that describe only two lines so that no matter how many samples are tested, the calculations of those parameters result in just those two lines which are independent of each other. While “n” data points occupy “n” dimensions, the mean and standard deviation occupy only two. We can use the standard deviation as the ruler to measure the lengths of interest.
What makes things difficult is the fuzziness of those lines. In quality control the first thing we want to determine is the length of the distance from the measured length (sample mean) to some desired length. To do that we use a ruler in which the standard deviation is set to be one. For convenience, and because the standard deviation is defined as the second moment around the mean, the targeted mean is subtracted from the data points so that the resulting length of the data vector is reduced to the difference between the sample mean and the target. That length is then divided by the standard deviation. The resulting length is then measured not in inches or millimeters but rather in units of the standard deviation ruler. As an example, assume that 100 was the target value, the measured mean was 85 and the standard deviation was 10. We are not interested in what the actual measured mean is, but rather how close it is to the target, based upon the standard deviation ruler:
1. (100-85)/10 = a distance of 1.5 SD units. In some cases the measurement is not from the desired target, but to upper and lower limits.
However, the mean value is fuzzy and the standard deviation may or may not be fuzzy. The data generated in calculating the mean make up a random variable (X= (x1, x2, —, xn)) in vector space. How fuzzy it is depends upon the length of the SD, and the type of distribution. While there are many distributions, if the SD is not fuzzy, what is called the normal distribution is often used. Because of the uncertainty in the mean, the distribution function tells us the chances of the mean actually being somewhere else. In example 1 with only the mean being fuzzy, and using the normal distribution, we can say that there is a 6.68% chance that the true mean of the data is the desired mean.
Unfortunately, the SD often is fuzzy too and is thus also a random variable. The square of the SD is called the variance, and has its own distribution function called the chi squared distribution. While the normal distribution is independent of the number of data points defining the random variable, the form of the chi squared distribution depends upon the degrees of freedom. The chi square distribution with one degree of freedom is the square of the normal distribution. That distribution may be used to determine whether two measured standard deviations are really the same.
How the fuzziness or uncertainty is handled will be covered later. Although the mathematics gets more complex, especially when multivariate sets of data must be considered, the goal is still to simply measure lengths with a specific ruler.