**Basic Statistics**

Collecting and gathering data is one of the primary principles of Six Sigma and Lean Six Sigma. We don’t make any decisions in the application of statistical methods without having data and, more importantly, we don’t get to understand our processes unless we collect the right kind of data. A statistic is pretty much any number that comes from or can be deduced or calculated from a sample of data.

Organizations can take various approaches to data usage in driving problem-solving. The approach used by a particular company is related to the level of maturity of its problem-solving methodology. These decision-making levels range from using your intuition and going with a gut feeling, to qualitative and quantitative brainstorming tools, to basic and inferential statistics.

We generate statistics because they allow us to infer about the population. What kinds of things can we infer about the population? Well, first, we can infer that the process is delivering to some target average value. Secondly, we can add specification limits and decide how often the process is delivering to some specification limit. We can also understand and calculate the capability of a process or its sigma level. Next, we can decide if there’s been a change in the performance or a shift in the mean of some process. We can decide if there’s been a change in the variation of our process and, lastly, we can figure out the defect rate.

There are two different types of data — discrete and continuous. Within discrete data, we have nominal and ordinal data. Nominal data can be described as categorical or pass/fail data, meaning a response can be lumped into categories. For example, if you’re asked whether you’re male or female, whether you’re black, white or other, whether you’re Operator A, B or C, whether the light is green, yellow or red.

Attribute data is pass/fail, good/bad, yes/no type information. We’ll see in many of our Six Sigma projects that inspection data is discrete, nominal, attribute data that can be used for making decisions about processes.

The next level of maturity of discrete data is something called ordinal data. Ordinal data has slightly more information than nominal data. Ordinal data can be ordered within very discrete categories.

For example, when you fill out a survey and it asks you if you enjoyed a training session and you say disagree, agree or strongly agree, you now have something called ordinal data. The categories that you have belong in a particular order, which is why we use the word “ordinal” to describe this type of information.

Continuous data has two different categories. In particular, there’s interval continuous data and ratio continuous data. Interval continuous data has equal intervals within a continuous scale of measure and does not really have an absolute zero, for example, measuring temperature in degrees Celsius or degrees Fahrenheit. Ratio data also has equal intervals, but it does have an absolute zero; like force or speed, are all ratio data types.

We use the following Statistics to infer results:

- Measures of location, or central tendency. There are three commonly used measures of location, the mode, the median and the arithmetic mean. The mode is not used as commonly as the other two. The mode of a data set is the value in a group of measurements that occurs most frequently. The median of a set of measurements is defined to be the middle value of the measurements when they’re arranged in rank order, otherwise known as the 50th percentile, where half the data lies below the median and half the data lies above.

- The mode, which is the observation or the value in the data set that occurs most frequently. We can see the number that repeats itself to infer the most occurring result.

- Median, that we can utilize from the data set to demonstrate the middle value among the data set. If we have 11 observations and list them in rank order or order of magnitude, the middle observation is, in fact, the median. The middle observation is where half the data — here five observations — lie below the median and half the data lie above the median.

- Mean, which is the most commonly used estimate for measuring central tendency is the mean. Just as we had population modes and medians and sample modes and medians, we have population means and sample means. The population mean is known as μ. This is the Greek letter representing the population mean. The formula for a population mean is to take all of the individual Xs or all of the individual values associated with some performance characteristic, summing them together and dividing by the total number of observations, N, in that population. That will give us the population mean.

- Variance, also known as σ
^{2}, is calculated by taking each individual observation of the population, x_{i}, subtracting population mean, μ, squaring that deviation, summing the squared deviations together and dividing by the total number of deviations we have, called N. The sample variance, commonly known to be s^{2}, is calculated by taking each individual sample observation, subtracting from the sample mean, X-bar, squaring that deviation, summing all those deviations together and dividing by n minus one, little n minus one, which is the number of observations in our sample minus one.

- Standard deviation, which is, the square root of the variance. Hence, the population standard deviation is the square root of the population variance and the sample standard deviation, s, is, quite simply, the square root of the sample variance s
^{2}.

**Related Page:**

Check out our Statistical Process Control (SPC) Training, Basic Statistics Training, Six Sigma Training, Lean Six Sigma Training, Lean Training, Continuous Improvement Training or the full range of Training Courses for relevant courses on Control Charts, Statistics and how to streamline & improve your business processes.