## Measures of Location

When reporting a sample mean, use one extra significant digit.

- Sample mean: \(\bar{x}\)
- Population mean: \(\mu\)
- Sample median: \(\tilde{x}\)
- Population median: \(\tilde{\mu}\)

The mean is very sensitive to outliers. The median is completely
insensitive to outliers. An in-between is the **trimmed mean**. A 10%
trimmed mean is one where you drop the top and bottom 10%, and then
calculate the mean. But what if you want a 10% trimmed mean, but your
sample size is 22? 10% of 22 is 2.2. So calculate the trimmed mean where
you remove 2 elements, and again where you remove 3 elements, and take a
linear interpolated value.

You normally will use a 10-20% for the trimmed mean.

Quartiles and percentiles are generalizations of the median.

The **mode** of a sample is the value that appears the most often.

## Measures of Variability

The **range** of a data set is simply the difference between the largest
and the smallest values.

The **sample variance** is given by:

The **population variance** is given by:

Why use \(n-1\) for the sample variance and not \(n\)? The
reason is that \(\bar{x}\ne\mu\). \(s^{2}\) is minimized if you
use \(\bar{x}\) as the reference point. Hence we would be
underestimating the variance if we divide by \(n\). To account for
this, divide by \(n-1\). Without this, it would be a *biased
estimator* (to be explained elsewhere). Dividing by \(n-1\) makes it
an unbiased estimator.

We say \(s^{2}\) has \(n-1\) degrees of freedom (df), as one constraint is that \(\sum{x_{i}-\bar{x}}=0\).

Note that:

The variance is:

- Invariant to translations in the data.
- If you scale all values by \(c\), then the new \(s^{2}\) is scaled by \(c^{2}\).

The **lower/upper fourth** is the median of the smallest/largest half.
The **fourth spread** \(f_{s}\) is the difference of these two. This
measure of spread is relatively unaffected by outliers.

An **outlier** is any observation farther than \(1.5f_{s}\) from the
closest fourth. It is an **extreme** outlier if it is greater than
\(3f_{s}\) away from the closest fourth. It is **mild** otherwise.

**midrange**: \((x_{min}+x_{max})/2\)

**midfourth**: Average of two fourths

**Exponential smoothing**: A way to smoothen out data from a time series
that has a lot of fluctuations (like my CPU temperature data): Pick
\(0<\alpha<1\). Let \(\bar{x}_{t}\) be the smoothened value at
\(t\). Let \(\bar{x}_{1}=x_{1}\) and \(\bar{x}_{t}=\alpha x_{t}+(1-\alpha)\bar{x}_{t-1}\)