This is a notebook about Statistic, taken while watching online course Statistic from Khan Academy
Basic notation
sample size :
population size :
popualation mean :
sample mean :
the population variance :
the sample variance :
But often, the sample variance is less than the population variance, so in order to estimate the population variance using the sample, we need to modify it as follows:
- the unbiased sample variance :
- population stanard deviation :
- sample stanard deviation :
**********************************************************TODO 推导方差
Random Variable
A function mapping a experiment to a variable
There are two types:
- discrete random variable
- continues random variable
Bernoulli distribution
A single trial with two possible outcome, p% to succed and (1-p)% to fail
Expectation of Bernoulli distribution
Binomial distribution
The binomial distribution is a Bernoulli distribution with n tries, n>1.
example:
A person took 6 shoots with 30% likely to make the shot
p(X = 0) = 0.7 * 0.7 * 0.7 * 0.7 * 0.7 * 0.7 =
p(X = 1) = 0.3 * 0.7 * 0.7 * 0.7 * 0.7 * 0.7 =
p(X = 2) = 0.3 * 0.3 * 0.7 * 0.7 * 0.7 * 0.7 =
p(X = 3) = 0.3 * 0.3 * 0.3 * 0.7 * 0.7 * 0.7 =
p(X = 4) = 0.3 * 0.3 * 0.3 * 0.3 * 0.7 * 0.7 =
p(X = 5) = 0.3 * 0.3 * 0.3 * 0.3 * 0.3 * 0.7 =
p(X = 6) = 0.3 * 0.3 * 0.3 * 0.3 * 0.3 * 0.3 =
Expectation of Binomial distribution
n tries with possibility of p to success for each try
**********************************************************TODO
Poisson distribution
based on two assumption:
- the event happens at stable rate, any period of time is no different than another
- the events between different time period are independent
The expectation
say X = number of cars passion in 1 hour
cars/hour = =
this is to split one hour into 60 min and examin if 1 car will pass in each minute with the possibility of and test rize n = 60
WHAT IF more than 1 cars passes ---> MORE GRANULAR
Set the interval to 1 sec:
--> more and more granular -->
Possion distribution formular
**********************************************************TODO
Normal Distribuation
Z-score is how many standard deviations away from the sample mean and it can be used on any distributions
NOTE :
the Z table is a accumulative distribution
Significant level :
- : p = 68%
- : p = 95% -> common confidential Interval
- : p = 99.7%
Central limit theorem
sample sum or sample mean from any distribution will display a normal distribution with large enough sampling
Standard Error : = Sampling distribution of the sample mean
and in the sampling distribution :
: mean of sampling distribution
: mean of original population
In most cases, we aren't not able to get the standard deivation of the population and std of sample is used to estimate the true std of population
, and subsequently estimate the std of the sampling distribution
p( is within 2 if ) =
p( is within 2 if )
One tail test
if a point where the normalized z score is 2, the area under the curve is before this point is 97.5%, the p value is 1-0.25 = 97.5%
two tail test
if a point where the normalized z score is 2, the area under the curve is before this point is 97.5%, the p value is 1-2*0.25 = 95%
so two tail test is recommended for its stricter p value
About sample size
if n > 30, the sd is considered close to
if n < 30, we use t-distribution to estimate its significant level instead of normal distribution
Independent random variables
for two independent variables X, Y
For another variable Z, if Z = X + Y:
For another variable Z, if G = X - Y:
NOTE :
It is variance, not standard deviation
If we have the sample of x and y, even with different sample size(n,m),
we wil have their sampling distribution, and if we are interested in (the difference between their sampling distribution):
We will have a diff sampling distribution like this:
NOTE
The one random variable test is to test:
- Given a population mean, we can calculate the probability of getting that sample, and do a hypothesis test(test if the sample mean is the population mean), accept or reject the hypothesis based a p value.
- calculate the confidentila interval without the population mean provided.
The two random variable test is to test how different these two varibles are:
- Given the mean, we can calculate the probability of getting those sample, and do a hypothesis test(test if they are different, null : they are the not different, the mean of diff is 0 and they should have the same sd, use the overall sd to estimate if possible), accept or reject the hypothesis based a p value.
- calculate the confidentila interval without the mean provided.
Linear regression
To find a line that has represent the data point best,
the fitted line should have the minimized squared error
with the line being
SE_line (squared error agianst the line):
We can formulate m and b using partial derivative
**********************************************************TODO zuixiao ercheng
Assessment of the fit
- total variation of y:
- total variation NOT described by the line :
Total variation = those described by the line + those not derribed by the line
Here, R squared is the coeffiicient of determination, showing what % of the total variation is descrubed by variation in x according to the line
**********************************************************TODO F test & p value
Co-variance
Chi-squared Distribution
A test of whether distributions are different
Day | Mon | Tue | Wen | Thu | Fri | Sat | |
---|---|---|---|---|---|---|---|
Expected % | 10 | 10 | 15 | 20 | 30 | 25 | |
Oberserved | 30 | 14 | 34 | 45 | 57 | 20 | total =200 |
Expected | 20 | 20 | 30 | 40 | 60 | 30 |
Chi-square statistic
df = 6-1 = 5 as we take 6 sums
= number of the observation
= number of the expectation,
Anova - Analysis of variance
F statistic
F statistic can be considered the ratio of two chisquare distribution
- SSB : sum of square between groups
- SSW : sum of square within groups
- SST : sum of square total
Example :
c1 | c2 | c3 |
---|---|---|
3 | 5 | 5 |
2 | 3 | 6 |
1 | 4 | 7 |
- dfT = mn -1 = 8
- dfW = mn-m = 6
- dfB = m -1 = 2
Fisher exact test
G1 | G2 | sum |
---|---|---|
a | b | a+b |
c | d | c+d |
a+c | b+d | n= a+b+c+d |
Example
M | F | sum |
---|---|---|
1 | 9 | 10 |
11 | 3 | 14 |
12 | 12 | 24 |
NOTE
the question to answer here is, Given the row and col sum
how likely it is to get distribution observed.
The end