Measures of variability: the range, inter-quartile range and standard deviation
Measures of average such as the median and mean represent the typical value for a dataset. Within the dataset the actual values usually differ from one another and from the average value itself. The extent to which the median and mean are good representatives of the values in the original dataset depends upon the variability or dispersion in the original data. Datasets are said to have high dispersion when they contain values considerably higher and lower than the mean value.
The range is the most obvious measure of dispersion and is the difference between the lowest and highest values in a datase.
- The range is simple to compute and is useful when you wish to evaluate the whole of a dataset.
- The range is useful for showing the spread within a dataset and for comparing the spread between similar datasets.
The Inter-quartile Range
The inter-quartile range is a measure that indicates the extent to which the central 50% of values within the dataset are dispersed. It is based upon, and related to, the median.
In the same way that the median divides a dataset into two halves, it can be further divided into quarters by identifying the upper and lower quartiles. The lower quartile is found one quarter of the way along a dataset when the values have been arranged in order of magnitude; the upper quartile is found three quarters along the dataset. Therefore, the upper quartile lies half way between the median and the highest value in the dataset whilst the lower quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile range is found by subtracting the lower quartile from the upper quartile.
The Standard Deviation
The standard deviation is a measure that summarises the amount by which every value within a dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are bunched around the mean value. It is the most robust and widely used measure of dispersion since, unlike the range and inter-quartile range, it takes into account every variable in the dataset. When the values in a dataset are pretty tightly bunched together the standard deviation is small. When the values are spread apart the standard deviation will be relatively large. The standard deviation is usually presented in conjunction with the mean and is measured in the same units.
In many datasets the values deviate from the mean value due to chance and such datasets are said to display a normal distribution. In a dataset with a normal distribution most of the values are clustered around the mean while relatively few values tend to be extremely high or extremely low. Many natural phenomena display a normal distribution.
For datasets that have a normal distribution the standard deviation can be used to determine the proportion of values that lie within a particular range of the mean value. For such distributions it is always the case that 68% of values are less than one standard deviation (1SD) away from the mean value, that 95% of values are less than two standard deviations (2SD) away from the mean and that 99% of values are less than three standard deviations (3SD) away from the mean.
- 68% of the values in the dataset will lie between MEAN-1SD
- 99% of the values will lie between MEAN-3SD and MEAN+3SD
Population and sample standard deviations
There are two different calculations for the Standard Deviation. Which formula you use depends upon whether the values in your dataset represent an entire population or whether they form a sample of a larger population. For example, if all student users of the library were asked how many books they had borrowed in the past month then the entire population has been studied since all the students have been asked. In such cases the population standard deviation should be used. Sometimes it is not possible to find information about an entire population and it might be more realistic to ask a sample of 150 students about their library borrowing and use these results to estimate library borrowing habits for the entire population of students. In such cases the sample standard deviation should be used.
Formula for the standard deviation
The standard deviation of an entire population is known as σ (sigma) and is calculated using:
Where x represents each value in the population, μ is the mean value of the population, Σ is the summation (or total), and N is the number of values in the population.
The standard deviation of a sample is known as S and is calculated using:
Where x represents each value in the population, x is the mean value of the sample, Σ is the summation (or total), and n-1 is the number of values in the sample minus 1.
The range, inter-quartile range and standard deviation are all measures that indicate the amount of variability within a dataset. The range is the simplest measure of variability to calculate but can be misleading if the dataset contains extreme values. The inter-quartile range reduces this problem by considering the variability within the middle 50% of the dataset. The standard deviation is the most robust measure of variability since it takes into account a measure of how every value in the dataset varies from the mean. However, care must be taken when calculating the standard deviation to consider whether the entire population or a sample is being examined and to use the appropriate formula.