http://blog.csdn.net/pipisorry/article/details/49515215
统计函数Statistical functions(scipy.stats)
Python有一个很好的统计推断包。那就是scipy里面的stats。
This module contains a large number of probability distributions as well as a growing library of statistical functions.
Each included distribution is an instance of the class rv_continous: For each given name the following methods are
available:
连续的
rv_continuous([momtype, a, b, xtol, ...]) A generic continuous random variable class meant for subclassing.
rv_continuous.pdf(x, *args, **kwds) Probability density function at x of the given RV.
rv_continuous.logpdf(x, *args, **kwds) Log of the probability density function at x of the given RV.
rv_continuous.cdf(x, *args, **kwds) Cumulative distribution function of the given RV.
rv_continuous.logcdf(x, *args, **kwds) Log of the cumulative distribution function at x of the given RV.
rv_continuous.sf(x, *args, **kwds) Survival function (1-cdf) at x of the given RV.
rv_continuous.logsf(x, *args, **kwds) Log of the survival function of the given RV.
rv_continuous.ppf(q, *args, **kwds) Percent point function (inverse of cdf) at q of the given RV.
rv_continuous.isf(q, *args, **kwds) Inverse survival function at q of the given RV.
rv_continuous.moment(n, *args, **kwds) n’th order non-central moment of distribution.
rv_continuous.stats(*args, **kwds) Some statistics of the given RV
rv_continuous.entropy(*args, **kwds) Differential entropy of the RV.
rv_continuous.fit(data, *args, **kwds) Return MLEs for shape, location, and scale parameters from data.
rv_continuous.expect([func, args, loc, ...]) Calculate expected value of a function with respect to the distribution.
离散的
rv_discrete([a, b, name, badvalue, ...]) A generic discrete random variable class meant for subclassing.
rv_discrete.rvs(*args, **kwargs) Random variates of given type.
rv_discrete.pmf(k, *args, **kwds) Probability mass function at k of the given RV.
rv_discrete.logpmf(k, *args, **kwds) Log of the probability mass function at k of the given RV.
rv_discrete.cdf(k, *args, **kwds) Cumulative distribution function of the given RV.
rv_discrete.logcdf(k, *args, **kwds) Log of the cumulative distribution function at k of the given RV
rv_discrete.sf(k, *args, **kwds) Survival function (1-cdf) at k of the given RV.
rv_discrete.logsf(k, *args, **kwds) Log of the survival function of the given RV.
rv_discrete.ppf(q, *args, **kwds) Percent point function (inverse of cdf) at q of the given RV
rv_discrete.isf(q, *args, **kwds) Inverse survival function (1-sf) at q of the given RV.
rv_discrete.stats(*args, **kwds) Some statistics of the given RV
rv_discrete.moment(n, *args, **kwds) n’th order non-central moment of distribution.
rv_discrete.entropy(*args, **kwds) Differential entropy of the RV.
rv_discrete.expect([func, args, loc, lb, ...]) Calculate expected value of a function with respect to the distribution
连续分布
alpha An alpha continuous random variable.
anglit An anglit continuous random variable.
arcsine An arcsine continuous random variable.
beta A beta continuous random variable.
betaprime A beta prime continuous random variable.
bradford A Bradford continuous random variable.
burr A Burr continuous random variable.
cauchy A Cauchy continuous random variable.
chi A chi continuous random variable.
chi2 A chi-squared continuous random variable.
cosine A cosine continuous random variable.
dgamma A double gamma continuous random variable.
dweibull A double Weibull continuous random variable.
erlang An Erlang continuous random variable.
expon An exponential continuous random variable.
exponweib An exponentiated Weibull continuous random variable.
exponpow An exponential power continuous random variable.
f An F continuous random variable.
fatiguelife A fatigue-life (Birnbaum-Sanders) continuous random variable.
fisk A Fisk continuous random variable.
foldcauchy A folded Cauchy continuous random variable.
foldnorm A folded normal continuous random variable.
frechet_r A Frechet right (or Weibull minimum) continuous random variable.
frechet_l A Frechet left (or Weibull maximum) continuous random variable.
genlogistic A generalized logistic continuous random variable.
genpareto A generalized Pareto continuous random variable.
genexpon A generalized exponential continuous random variable.
genextreme A generalized extreme value continuous random variable.
gausshyper A Gauss hypergeometric continuous random variable.
gamma A gamma continuous random variable.
gengamma A generalized gamma continuous random variable.
genhalflogistic A generalized half-logistic continuous random variable.
gilbrat A Gilbrat continuous random variable.
gompertz A Gompertz (or truncated Gumbel) continuous random variable.
gumbel_r A right-skewed Gumbel continuous random variable.
gumbel_l A left-skewed Gumbel continuous random variable.
halfcauchy A Half-Cauchy continuous random variable.
halflogistic A half-logistic continuous random variable.
halfnorm A half-normal continuous random variable.
hypsecant A hyperbolic secant continuous random variable.
invgamma An inverted gamma continuous random variable.
invgauss An inverse Gaussian continuous random variable.
invweibull An inverted Weibull continuous random variable.
johnsonsb A Johnson SB continuous random variable.
johnsonsu A Johnson SU continuous random variable.
ksone General Kolmogorov-Smirnov one-sided test.
kstwobign Kolmogorov-Smirnov two-sided test for large N.
laplace A Laplace continuous random variable.
logistic A logistic (or Sech-squared) continuous random variable.
loggamma A log gamma continuous random variable.
loglaplace A log-Laplace continuous random variable.
lognorm A lognormal continuous random variable.
lomax A Lomax (Pareto of the second kind) continuous random variable.
maxwell A Maxwell continuous random variable.
mielke A Mielke’s Beta-Kappa continuous random variable.
nakagami A Nakagami continuous random variable.
ncx2 A non-central chi-squared continuous random variable.
ncf A non-central F distribution continuous random variable.
nct A non-central Student’s T continuous random variable.
norm A normal continuous random variable.
{高斯[正态]分布随机变量,A normal continuous random variable.}
参数:
The location (loc) keyword specifies the mean.
The scale (scale) keyword specifies the standard deviation.
方法:
例:python(scipy)生成服从高斯分布N(0, 0.01)的随机误差向量(10*1)
高斯分布N(0,0.01)随机偏差y =stats.norm.rvs(loc=0, scale=0.1, size=10) 输出:array([ 0.05419826, 0.04151471, -0.10784729, 0.18283546, 0.02348312, -0.04611974, 0.0069336 , 0.03840133, -0.05015316, 0.23315205])
[如何产生正态分布的随机数?]
[scipy.stats.norm]
[python标准库中的随机分布函数]
pareto A Pareto continuous random variable.
pearson3 A pearson type III continuous random variable.
powerlaw A power-function continuous random variable.
powerlognorm A power log-normal continuous random variable.
powernorm A power normal continuous random variable.
rdist An R-distributed continuous random variable.
reciprocal A reciprocal continuous random variable.
rayleigh A Rayleigh continuous random variable.
rice A Rice continuous random variable.
recipinvgauss A reciprocal inverse Gaussian continuous random variable.
semicircular A semicircular continuous random variable.
t A Student’s T continuous random variable.
triang A triangular continuous random variable.
truncexpon A truncated exponential continuous random variable.
truncnorm A truncated normal continuous random variable.
tukeylambda A Tukey-Lamdba continuous random variable.
uniform A uniform continuous random variable.
vonmises A Von Mises continuous random variable.
wald A Wald continuous random variable.
weibull_min A Frechet right (or Weibull minimum) continuous random variable.
weibull_max A Frechet left (or Weibull maximum) continuous random variable.
wrapcauchy A wrapped Cauchy continuous random variable.
多变量分布Multivariate distributions
multivariate_normal A multivariate normal random variable.
离散分布Discrete distributions
bernoulli A Bernoulli discrete random variable.
binom A binomial discrete random variable.
boltzmann A Boltzmann (Truncated Discrete Exponential) random variable.
dlaplace A Laplacian discrete random variable.
geom A geometric discrete random variable.
hypergeom A hypergeometric discrete random variable.
logser A Logarithmic (Log-Series, Series) discrete random variable.
nbinom A negative binomial discrete random variable.
planck A Planck discrete exponential random variable.
poisson A Poisson discrete random variable.
randint A uniform discrete random variable.
skellam A Skellam discrete random variable.
zipf A Zipf discrete random variable.
统计函数Statistical functions
describe(a[, axis]) Computes several descriptive statistics of the passed array.
gmean(a[, axis, dtype]) Compute the geometric mean along the specified axis.
hmean(a[, axis, dtype]) Calculates the harmonic mean along the specified axis.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tvar(a[, limits, inclusive]) Compute the trimmed variance
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
tmax(a[, upperlimit, axis, inclusive]) Compute the trimmed maximum
tstd(a[, limits, inclusive]) Compute the trimmed sample standard deviation
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
nanmean(x[, axis]) Compute the mean over the given axis ignoring nans.
nanstd(x[, axis, bias]) Compute the standard deviation over the given axis, ignoring nans.
nanmedian(x[, axis]) Compute the median along the given axis ignoring nan values.
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation to the mean.
describe函数
这个函数的输出太难看了!
age =[23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61]
fat_percent =[9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7]
age =np.array(age)
fat_percent =np.array(fat_percent)
data =np.vstack([age, fat_percent]).reshape([-1, 2])
print(stats.describe(data)) DescribeResult(nobs=18, minmax=(array([ 7.8, 17.8]), array([ 60., 61.])), mean=array([ 37.36111111, 37.86666667]), variance=array([ 236.58604575, 188.78588235]), skewness=array([-0.30733374, 0.40999364]), kurtosis=array([-0.65245849, -1.26315357]))
修改了一个输出结果形式
forkey, value instats.describe(data)._asdict().items():print(key, ':', value) nobs : 18
minmax : (array([ 7.8, 17.8]), array([ 60., 61.]))
mean : [ 37.36111111 37.86666667]
variance : [ 236.58604575 188.78588235]
skewness : [-0.30733374 0.40999364]
kurtosis : [-0.65245849 -1.26315357]
也可以使用pandas中的函数进行替代,这样输出比较舒服[python数据处理库pandas]
其它。。。
假设检验
ttest_1samp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var]) Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements
ttest_1samp实现了单样本t检验。因此,如果我们想检验数据Abra列的稻谷产量均值,通过零假设,这里我们假定总体稻谷产量均值为15000,我们有: from scipy import stats as ss
# Perform one sample t-test using 1500 as the true mean
print ss.ttest_1samp(a = df.ix[:, 'Abra'], popmean = 15000)
# OUTPUT
(-1.1281738488299586, 0.26270472069109496)
返回下述值组成的元祖:
t : 浮点或数组类型
t统计量
prob : 浮点或数组类型
two-tailed p-value 双侧概率值
通过上面的输出,看到p值是0.267远大于α等于0.05,因此没有充分的证据说平均稻谷产量不是150000。将这个检验应用到所有的变量,同样假设均值为15000,我们有: print ss.ttest_1samp(a = df, popmean = 15000)
# OUTPUT
(array([ -1.12817385, 1.07053437, -65.81425599, -4.564575 , 6.17156198]),
array([ 2.62704721e-01, 2.87680340e-01, 4.15643528e-70,
1.83764399e-05, 2.82461897e-08]))
第一个数组是t统计量,第二个数组则是相应的p值。
其它。。。
列联表函数Contingency table functions
chi2_contingency(observed[, correction, lambda_]) Chi-square test of independence of variables in a contingency table.
contingency.expected_freq(observed) Compute the expected frequencies from a contingency table.
contingency.margins(a) Return a list of the marginal sums of the array a.
fisher_exact(table[, alternative]) Performs a Fisher exact test on a 2x2 contingency table.
绘图测试Plot-tests
ppcc_max(x[, brack, dist]) Returns the shape parameter that maximizes the probability plot correlation coefficient for ppcc_plot(x, a, b[, dist, plot, N]) Returns (shape, ppcc), and optionally plots shape vs.
probplot(x[, sparams, dist, fit, plot]) Calculate quantiles for a probability plot, and optionally show the plot.
boxcox_normplot(x, la, lb[, plot, N]) Compute parameters for a Box-Cox normality plot, optionally show it.
蒙面统计函数Masked statistics functions
argstoarray(*args) Constructs a 2D array from a group of sequences.
betai(a, b, x) Returns the incomplete beta function.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
count_tied_groups(x[, use_missing]) Counts the number of tied values.
describe(a[, axis]) Computes several descriptive statistics of the passed array.
f_oneway(*args) Performs a 1-way ANOVA, returning an F-value and probability given any f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Calculation of Wilks lambda F-statistic for multivariate data, per Maxwell find_repeats(arr) Find repeats in arr and return a tuple (repeats, repeat_count).
friedmanchisquare(*args) Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA.
kendalltau(x, y[, use_ties, use_missing]) Computes Kendall’s rank correlation tau on two variables x and y.
kendalltau_seasonal(x) Computes a multivariate Kendall’s rank correlation tau, for seasonal data.
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
linregress(*args) Calculate a regression line
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney statistic
plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
mquantiles(a[, prob, alphap, betap, axis, limit]) Computes empirical quantiles for a data array.
msign(x) Returns the sign of x, or 0 if x is masked.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
obrientransform(*args) Computes a transform on input data (any number of columns).
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and the associated p-value.
rankdata(data[, axis, use_missing]) Returns the rank (also known as order statistics) of each data point along scoreatpercentile(data, per[, limit, ...]) Calculate the score at the given ‘per’ percentile of the sequence a.
sem(a[, axis, ddof]) Calculates the standard error of the mean (or standard error of measurement) signaltonoise(data[, axis]) Calculates the signal-to-noise ratio, as the ratio of the mean over standard skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
spearmanr(x, y[, use_ties]) Calculates a Spearman rank-order correlation coefficient and the p-value theilslopes(y[, x, alpha]) Computes the Theil slope as the median of all slopes between paired values.
threshold(a[, threshmin, threshmax, newval]) Clip array to a given value.
tmax(a, upperlimit[, axis, inclusive]) Compute the trimmed maximum
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
trim(a[, limits, inclusive, relative, axis]) Trims an array by masking the data outside some given limits.
trima(a[, limits, inclusive]) Trims an array by masking the data outside some given limits.
trimboth(data[, proportiontocut, inclusive, ...]) Trims the smallest and largest data values.
trimmed_stde(a[, limits, inclusive, axis]) Returns the standard error of the trimmed mean along the given axis.
trimr(a[, limits, inclusive, axis]) Trims an array by masking some proportion of the data on each end.
trimtail(data[, proportiontocut, tail, ...]) Trims the data by masking values from one tail.
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis]) Calculates the T-test for the means of TWO INDEPENDENT samples of ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
tvar(a[, limits, inclusive]) Compute the trimmed variance
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation winsorize(a[, limits, inclusive, inplace, axis]) Returns a Winsorized version of the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample
单变量和多变量核密度估计Univariate and multivariate kernel density estimation (scipy.stats.kde)
gaussian_kde(dataset[, bw_method]) Representation of a kernel-density estimate using Gaussian kernels.
皮皮blog
Statistical functions for masked arrays (scipy.stats.mstats)
。。。
from:http://blog.csdn.net/pipisorry/article/details/49515215
ref:Statistical functions (scipy.stats)