This is a notebook about Statistic, taken while watching online course Statistic from Khan Academy

Basic notation

sample size :

population size :

popualation mean :

sample mean :

the population variance :

the sample variance :

But often, the sample variance is less than the population variance, so in order to estimate the population variance using the sample, we need to modify it as follows:

the unbiased sample variance :

population stanard deviation :

sample stanard deviation :

**********************************************************TODO 推导方差

Random Variable

A function mapping a experiment to a variable

There are two types:

discrete random variable

continues random variable

Bernoulli distribution

A single trial with two possible outcome, p% to succed and (1-p)% to fail

Expectation of Bernoulli distribution

Binomial distribution

The binomial distribution is a Bernoulli distribution with n tries, n>1.

example:
A person took 6 shoots with 30% likely to make the shot

p(X = 0) = 0.7 * 0.7 * 0.7 * 0.7 * 0.7 * 0.7 =

p(X = 1) = 0.3 * 0.7 * 0.7 * 0.7 * 0.7 * 0.7 =

p(X = 2) = 0.3 * 0.3 * 0.7 * 0.7 * 0.7 * 0.7 =

p(X = 3) = 0.3 * 0.3 * 0.3 * 0.7 * 0.7 * 0.7 =

p(X = 4) = 0.3 * 0.3 * 0.3 * 0.3 * 0.7 * 0.7 =

p(X = 5) = 0.3 * 0.3 * 0.3 * 0.3 * 0.3 * 0.7 =

p(X = 6) = 0.3 * 0.3 * 0.3 * 0.3 * 0.3 * 0.3 =

Expectation of Binomial distribution

n tries with possibility of p to success for each try

**********************************************************TODO

Poisson distribution

based on two assumption:

the event happens at stable rate, any period of time is no different than another

the events between different time period are independent

The expectation

say X = number of cars passion in 1 hour

cars/hour = =

this is to split one hour into 60 min and examin if 1 car will pass in each minute with the possibility of and test rize n = 60

WHAT IF more than 1 cars passes ---> MORE GRANULAR

Set the interval to 1 sec:

--> more and more granular -->

Possion distribution formular

**********************************************************TODO

Normal Distribuation

Z-score is how many standard deviations away from the sample mean and it can be used on any distributions

NOTE :

the Z table is a accumulative distribution

Significant level :

: p = 68%

: p = 95% -> common confidential Interval

: p = 99.7%

Central limit theorem

sample sum or sample mean from any distribution will display a normal distribution with large enough sampling

Standard Error : = Sampling distribution of the sample mean
and in the sampling distribution :

: mean of sampling distribution

: mean of original population

In most cases, we aren't not able to get the standard deivation of the population and std of sample is used to estimate the true std of population
, and subsequently estimate the std of the sampling distribution

p( is within 2 if ) =
p( is within 2 if )

One tail test

if a point where the normalized z score is 2, the area under the curve is before this point is 97.5%, the p value is 1-0.25 = 97.5%

two tail test

if a point where the normalized z score is 2, the area under the curve is before this point is 97.5%, the p value is 1-2*0.25 = 95%

so two tail test is recommended for its stricter p value

About sample size

if n > 30, the sd is considered close to
if n < 30, we use t-distribution to estimate its significant level instead of normal distribution

Independent random variables

for two independent variables X, Y

For another variable Z, if Z = X + Y:

For another variable Z, if G = X - Y:

NOTE :

It is variance, not standard deviation

If we have the sample of x and y, even with different sample size(n,m),
we wil have their sampling distribution, and if we are interested in (the difference between their sampling distribution):

We will have a diff sampling distribution like this:

NOTE

The one random variable test is to test:

Given a population mean, we can calculate the probability of getting that sample, and do a hypothesis test(test if the sample mean is the population mean), accept or reject the hypothesis based a p value.
calculate the confidentila interval without the population mean provided.

The two random variable test is to test how different these two varibles are:

Given the mean, we can calculate the probability of getting those sample, and do a hypothesis test(test if they are different, null : they are the not different, the mean of diff is 0 and they should have the same sd, use the overall sd to estimate if possible), accept or reject the hypothesis based a p value.
calculate the confidentila interval without the mean provided.

Linear regression

To find a line that has represent the data point best,
the fitted line should have the minimized squared error

with the line being

SE_line (squared error agianst the line):

We can formulate m and b using partial derivative

**********************************************************TODO zuixiao ercheng

Assessment of the fit

total variation of y:

total variation NOT described by the line :

Total variation = those described by the line + those not derribed by the line

Here, R squared is the coeffiicient of determination, showing what % of the total variation is descrubed by variation in x according to the line

**********************************************************TODO F test & p value

Co-variance

Chi-squared Distribution

A test of whether distributions are different

Day	Mon	Tue	Wen	Thu	Fri	Sat
Expected %	10	10	15	20	30	25
Oberserved	30	14	34	45	57	20	total =200
Expected	20	20	30	40	60	30

Chi-square statistic

df = 6-1 = 5 as we take 6 sums

= number of the observation
= number of the expectation,

Anova - Analysis of variance

F statistic

F statistic can be considered the ratio of two chisquare distribution

SSB : sum of square between groups

SSW : sum of square within groups

SST : sum of square total

Example :

c1	c2	c3
3	5	5
2	3	6
1	4	7

dfT = mn -1 = 8
dfW = mn-m = 6
dfB = m -1 = 2

Fisher exact test

G1	G2	sum
a	b	a+b
c	d	c+d
a+c	b+d	n= a+b+c+d

Example

M	F	sum
1	9	10
11	3	14
12	12	24

NOTE
the question to answer here is, Given the row and col sum
how likely it is to get distribution observed.

The end

khan