Statistics include descriptive statistics & inferential statistics. In this chapter, we are going to take about descriptive statistics.
PART 1 Measuring center in quantitive data
PART 2 Interquartile range(IQR)
PART 3 Variance and standard deviation of population
PART 4 Variance and Standard Deviation of a Sample
PART 5 Box and Whisker Plots
PART 6 Other Measures of Spread
1. Average: to measure central tendency, describe the center of a set of data
2. Mean, median, and mode are three kinds of “Averages”. They each tries to summarize the dataset with a single number to represent a “typical” data point from the dataset.
(1) Arithmetic mean: the sum of the numbers divided by how many numbers are being averaged.
(2) Median: the middle number in a list of data; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).
- half of the data points are smaller than the median and half of the data points are larger than the median
(3) Mode: the most commonly occurring data point in a dataset.
- There can be no mode, one mode, or multiple modes in a dataset.
3. Choosing the “best” measure
【例】A list of number: 70,72,74,76,80,114 mean = 81, median = 75
4. The impact on mean and median after removing the outlier
(1) For median: it depends
(2) For mean: it depends
5. The impact on mean and median after increasing the value of the outlier
6. Mean as the balancing point: we can think of the mean as the balancing point, which is a fancy way of saying that the total distance from the mean to the data points below the mean is equal to the total distance from the mean to the data points above the mean.
前面讲的是how to measure the center tendency of a data set
现在:Range and IQR both measure the “spread” in the dataset, how far away from the center
Looking at spread lets us see how much data varies
1. IQR: is the amount of spread in the middle 50% of a dataset, that is the difference between 75th and 25th percentiles
2.
3. Range is sensitive to the outliers but interquartile range is not impacted by the outlier.
To describe how spread apart the data is, how far away from the center
1. Range: the difference between the largest and smallest values
2. Variance ( ): the expectation of the squared deviation of a random variable from its mean.
3. Standard deviation ( ): the root square root of its variance
4. Variance of a population
5. Standard deviation of a population
6. SD(Standard Deviation) versus MAD(Mean Absolute Deviation)
7. Standard deviation versus Variance
8. Mean and Standard deviation versus Median and IQR
【例】A list of data set: 35, 50, 50, 50, 56, 60, 60, 75, 250
Mean: 76.2 | Median: 56 |
Standard Deviation: 62.3 | Interquartile Range: 17.5 |
9. Alternate variance formula
1. Sample variance: it is an estimate of population variance
2. Sample standard deviation and bias
3. Why we divide by in sample variance
(1) When calculating the difference between each value and the mean of those values, all we know is the mean of the sample, rather than the mean of the population.
Except for the rare cases, where the sample mean happens to equal to the population mean, the data will be closer to the sample mean than to the population mean.
So, when summing up all of the squared distance, this value will be a bit smaller than the it would be if we used population mean when calculating the distance from each value from the population mean.
To make up for this, we divide by n-1 rather than n.
(2) Experiment:
4. Bias of an estimator
(1) Estimator: 基于观测数据计算一个已知量的估计值的法则
(2) 估计量是用来估计未知总体的参数
(3) Bias of an estimator: is the difference between this estimator’s expected value and the true value of the parameter being estimated
(4) unbiased: an estimator with zero bias
(5) bias can also be measured with other measures like median
1. A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.
2. Creating a box plot with odd number of data points
Creating a box plot with even number of data points
(1) 创建quartiles的时候要exclude the median
(2) Steps
3. When we want to show the spread of and central tendency of a dataset in a graph we should use box plot
4. 在理解box plot的时候要分even/odd number of data points两种情况考虑
5. Judging outliers in a dataset
outliers < Q1-1.5 * IQR OR outliers > Q3+1.5*IQR
【例】1,1,6,13,13,14,14,14,15,15,16,18,18,18,19
Q1-1.5*IQR=13-1.5*5=5.5 outliers < 5.5 the two 1s are outliers
Q3+1.5*IQR=18+1.5*5=25.5 outliers > 25.5 there is no high outliers
6. How to show outliers in box plot
In this example, the outliers are 5,7,10
1. Range and mid-range
2. Mean absolute deviation (MAD)