# Measures of Location and Variability

Posted by Beetle B. on Sat 13 May 2017

## Measures of Location

When reporting a sample mean, use one extra significant digit.

• Sample mean: $$\bar{x}$$
• Population mean: $$\mu$$
• Sample median: $$\tilde{x}$$
• Population median: $$\tilde{\mu}$$

The mean is very sensitive to outliers. The median is completely insensitive to outliers. An in-between is the trimmed mean. A 10% trimmed mean is one where you drop the top and bottom 10%, and then calculate the mean. But what if you want a 10% trimmed mean, but your sample size is 22? 10% of 22 is 2.2. So calculate the trimmed mean where you remove 2 elements, and again where you remove 3 elements, and take a linear interpolated value.

You normally will use a 10-20% for the trimmed mean.

Quartiles and percentiles are generalizations of the median.

The mode of a sample is the value that appears the most often.

## Measures of Variability

The range of a data set is simply the difference between the largest and the smallest values.

The sample variance is given by:

\begin{equation*} s^{2}=\frac{\sum\left(x_{i}-\bar{x}\right)^{2}}{n-1}=\frac{S_{xx}}{n-1} \end{equation*}

The population variance is given by:

\begin{equation*} \sigma^{2}=\frac{\sum\left(x_{i}-\mu\right)^{2}}{N} \end{equation*}

Why use $$n-1$$ for the sample variance and not $$n$$? The reason is that $$\bar{x}\ne\mu$$. $$s^{2}$$ is minimized if you use $$\bar{x}$$ as the reference point. Hence we would be underestimating the variance if we divide by $$n$$. To account for this, divide by $$n-1$$. Without this, it would be a biased estimator (to be explained elsewhere). Dividing by $$n-1$$ makes it an unbiased estimator.

We say $$s^{2}$$ has $$n-1$$ degrees of freedom (df), as one constraint is that $$\sum{x_{i}-\bar{x}}=0$$.

Note that:

\begin{equation*} S_{xx}=\sum{x_{i}^{2}}-\frac{1}{n}\left(\sum{x_{i}}\right)^{2} \end{equation*}

The variance is:

• Invariant to translations in the data.
• If you scale all values by $$c$$, then the new $$s^{2}$$ is scaled by $$c^{2}$$.

The lower/upper fourth is the median of the smallest/largest half. The fourth spread $$f_{s}$$ is the difference of these two. This measure of spread is relatively unaffected by outliers.

An outlier is any observation farther than $$1.5f_{s}$$ from the closest fourth. It is an extreme outlier if it is greater than $$3f_{s}$$ away from the closest fourth. It is mild otherwise.

midrange: $$(x_{min}+x_{max})/2$$

midfourth: Average of two fourths

Exponential smoothing: A way to smoothen out data from a time series that has a lot of fluctuations (like my CPU temperature data): Pick $$0<\alpha<1$$. Let $$\bar{x}_{t}$$ be the smoothened value at $$t$$. Let $$\bar{x}_{1}=x_{1}$$ and $$\bar{x}_{t}=\alpha x_{t}+(1-\alpha)\bar{x}_{t-1}$$