Python有一个很好的统计推断包。那就是scipy里面的stats。
Scipy的stats模块包含了多种概率分布的随机变量,随机变量分为连续的和离散的两种。
所有的连续随机变量都是rv_continuous的派生类的对象,而所有的离散随机变量都是 rv_discrete的派生类的对象。
This module contains a large number of probability distributions as well as a growing library of statistical functions.
Each univariate distribution is an instance of a subclass of rv_continuous(rv_discrete for discrete distributions):
rv_continuous([momtype, a, b, xtol, ...]) | A generic continuous random variable class meant for subclassing. |
rv_discrete([a, b, name, badvalue, ...]) | A generic discrete random variable class meant for subclassing. |
皮皮blog
alpha | An alpha continuous random variable. |
anglit | An anglit continuous random variable. |
arcsine | An arcsine continuous random variable. |
beta | A beta continuous random variable. |
betaprime | A beta prime continuous random variable. |
bradford | A Bradford continuous random variable. |
burr | A Burr (Type III) continuous random variable. |
burr12 | A Burr (Type XII) continuous random variable. |
cauchy | A Cauchy continuous random variable. |
chi | A chi continuous random variable. |
chi2 | A chi-squared continuous random variable. |
cosine | A cosine continuous random variable. |
dgamma | A double gamma continuous random variable. |
dweibull | A double Weibull continuous random variable. |
erlang | An Erlang continuous random variable. |
expon | An exponential continuous random variable. |
exponnorm | An exponentially modified Normal continuous random variable. |
exponweib | An exponentiated Weibull continuous random variable. |
exponpow | An exponential power continuous random variable. |
f | An F continuous random variable. |
fatiguelife | A fatigue-life (Birnbaum-Saunders) continuous random variable. |
fisk | A Fisk continuous random variable. |
foldcauchy | A folded Cauchy continuous random variable. |
foldnorm | A folded normal continuous random variable. |
frechet_r | A Frechet right (or Weibull minimum) continuous random variable. |
frechet_l | A Frechet left (or Weibull maximum) continuous random variable. |
genlogistic | A generalized logistic continuous random variable. |
gennorm | A generalized normal continuous random variable. |
genpareto | A generalized Pareto continuous random variable. |
genexpon | A generalized exponential continuous random variable. |
genextreme | A generalized extreme value continuous random variable. |
gausshyper | A Gauss hypergeometric continuous random variable. |
gamma | A gamma continuous random variable. |
gengamma | A generalized gamma continuous random variable. |
genhalflogistic | A generalized half-logistic continuous random variable. |
gilbrat | A Gilbrat continuous random variable. |
gompertz | A Gompertz (or truncated Gumbel) continuous random variable. |
gumbel_r | A right-skewed Gumbel continuous random variable. |
gumbel_l | A left-skewed Gumbel continuous random variable. |
halfcauchy | A Half-Cauchy continuous random variable. |
halflogistic | A half-logistic continuous random variable. |
halfnorm | A half-normal continuous random variable. |
halfgennorm | The upper half of a generalized normal continuous random variable. |
hypsecant | A hyperbolic secant continuous random variable. |
invgamma | An inverted gamma continuous random variable. |
invgauss | An inverse Gaussian continuous random variable. |
invweibull | An inverted Weibull continuous random variable. |
johnsonsb | A Johnson SB continuous random variable. |
johnsonsu | A Johnson SU continuous random variable. |
kappa4 | Kappa 4 parameter distribution. |
kappa3 | Kappa 3 parameter distribution. |
ksone | General Kolmogorov-Smirnov one-sided test. |
kstwobign | Kolmogorov-Smirnov two-sided test for large N. |
laplace | A Laplace continuous random variable. |
levy | A Levy continuous random variable. |
levy_l | A left-skewed Levy continuous random variable. |
levy_stable | A Levy-stable continuous random variable. |
logistic | A logistic (or Sech-squared) continuous random variable. |
loggamma | A log gamma continuous random variable. |
loglaplace | A log-Laplace continuous random variable. |
lognorm | A lognormal continuous random variable. |
lomax | A Lomax (Pareto of the second kind) continuous random variable. |
maxwell | A Maxwell continuous random variable. |
mielke | A Mielke’s Beta-Kappa continuous random variable. |
nakagami | A Nakagami continuous random variable. |
ncx2 | A non-central chi-squared continuous random variable. |
ncf | A non-central F distribution continuous random variable. |
nct | A non-central Student’s T continuous random variable. |
norm | A normal continuous random variable. |
pareto | A Pareto continuous random variable. |
pearson3 | A pearson type III continuous random variable. |
powerlaw | A power-function continuous random variable. |
powerlognorm | A power log-normal continuous random variable. |
powernorm | A power normal continuous random variable. |
rdist | An R-distributed continuous random variable. |
reciprocal | A reciprocal continuous random variable. |
rayleigh | A Rayleigh continuous random variable. |
rice | A Rice continuous random variable. |
recipinvgauss | A reciprocal inverse Gaussian continuous random variable. |
semicircular | A semicircular continuous random variable. |
skewnorm | A skew-normal random variable. |
t | A Student’s T continuous random variable. |
trapz | A trapezoidal continuous random variable. |
triang | A triangular continuous random variable. |
truncexpon | A truncated exponential continuous random variable. |
truncnorm | A truncated normal continuous random variable. |
tukeylambda | A Tukey-Lamdba continuous random variable. |
uniform | A uniform continuous random variable. |
vonmises | A Von Mises continuous random variable. |
vonmises_line | A Von Mises continuous random variable. |
wald | A Wald continuous random variable. |
weibull_min | A Frechet right (or Weibull minimum) continuous random variable. |
weibull_max | A Frechet left (or Weibull maximum) continuous random variable. |
wrapcauchy | A wrapped Cauchy continuous random variable. |
rvs(*args, **kwds) | Random variates of given type.产生服从这种分布的一个样本,对随机变量进行随机取值,可以通过size参数指定输出的数组大小。 |
pdf(x, *args, **kwds) | Probability density function at x of the given RV.随机变量的概率密度函数。产生对应x的这种分布的y值。 |
logpdf(x, *args, **kwds) | Log of the probability density function at x of the given RV. |
cdf(x, *args, **kwds) | Cumulative distribution function of the given RV.随机变量的累积分布函数,它是概率密度函数的积分(也就是x时p(X |
logcdf(x, *args, **kwds) | Log of the cumulative distribution function at x of the given RV. |
sf(x, *args, **kwds) | Survival function (1 - cdf) at x of the given RV.随机变量的生存函数,它的值是1-cdf(t)。 |
logsf(x, *args, **kwds) | Log of the survival function of the given RV. |
ppf(q, *args, **kwds) | Percent point function (inverse of cdf) at q of the given RV.累积分布函数的反函数。q=0.01时,ppf就是p(X |
isf(q, *args, **kwds) | Inverse survival function (inverse of sf) at q of the given RV. |
moment(n, *args, **kwds) | n-th order non-central moment of distribution. |
stats(*args, **kwds) | Some statistics of the given RV.计算随机变量的期望值和方差。 |
entropy(*args, **kwds) | Differential entropy of the RV. |
expect([func, args, loc, scale, lb, ub, ...]) | Calculate expected value of a function with respect to the distribution. |
median(*args, **kwds) | Median of the distribution. |
mean(*args, **kwds) | Mean of the distribution. |
std(*args, **kwds) | Standard deviation of the distribution. |
var(*args, **kwds) | Variance of the distribution. |
interval(alpha, *args, **kwds) | Confidence interval with equal areas around the median. |
__call__(*args, **kwds) | Freeze the distribution for the given arguments. |
fit(data, *args, **kwds) | Return MLEs for shape, location, and scale parameters from data.对一组随机取样进行拟合,找出最适合取样数据的概率密度函数的系数。如stats.norm.fit(x)就是将x看成是某个norm分布的抽样,求出其最好的拟合参数(mean, std)。 |
fit_loc_scale(data, *args) | Estimate loc and scale parameters from data using 1st and 2nd moments. |
nnlf(theta, x) | Return negative loglikelihood function. |
[scipy.stats.rv_continuous]
multivariate_normal | A multivariate normal random variable. |
matrix_normal | A matrix normal random variable. |
dirichlet | A Dirichlet random variable. |
wishart | A Wishart random variable. |
invwishart | An inverse Wishart random variable. |
special_ortho_group | A matrix-valued SO(N) random variable. |
ortho_group | A matrix-valued O(N) random variable. |
random_correlation | A random correlation matrix. |
>>> x, y = np.mgrid[-1:1:.01, -1:1:.01]
>>> pos = np.dstack((x, y)) #二维坐标组合成三维坐标点坐标
>>> rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
>>> rv.pdf(pos) #接受的参数是三维数据,第三维代表一个数据坐标,1、2维代表网格坐标位置。
皮皮blog
当分布函数的值域为离散时,称之为离散概率分布。例如投掷有6个面的骰子时,只能获得1到6的整数,因此得到的概率分布为离散的。
对于离散随机分布,通常使用概率质量函数(PMF)描述其分布情况。在stats库中所有描述离散分布的随机变量都从rv_discrete类继承。
stats.rv_discrete(values=(x,p))中的参数表示随机变量x和其对应的概率。
设有一个不均匀的骰子,各点出现的概率不相等。可以用下面的数组x保存骰子的所有可能值,数组p保存每个值出现的概率:
>>> x = range(1,7)
>>> p = (0.4, 0.2, 0.1, 0.1, 0.1, 0.1)
用下面的语句定义表示这个特殊骰子的随机变量,并调用其rvs()方法投掷此骰子20次,获得符合概率p的随机数:
>>> dice = stats.rv_discrete(values=(x,p))
>>> dice.rvs(size=20)
Array([2, 5, 1, 2, 1, 1, 2, 4, 1, 3, 1, 1, 4, 3, 1, 1, 1, 2, 6, 4])
from scipy import stats import numpy as np import matplotlib.pyplot as plt fs_meetsig = np.random.random(30) fs_xk = np.sort(fs_meetsig) fs_pk = np.ones_like(fs_xk) / len(fs_xk) fs_rv_dist = stats.rv_discrete(name='fs_rv_dist', values=(fs_xk, fs_pk)) plt.plot(fs_xk, fs_rv_dist.cdf(fs_xk), 'b-', ms=12, mec='r', label='friend') plt.show()
[rv_discrete Examples]
bernoulli | A Bernoulli discrete random variable. |
binom | A binomial discrete random variable. |
boltzmann | A Boltzmann (Truncated Discrete Exponential) random variable. |
dlaplace | A Laplacian discrete random variable. |
geom | A geometric discrete random variable. |
hypergeom | A hypergeometric discrete random variable. |
logser | A Logarithmic (Log-Series, Series) discrete random variable. |
nbinom | A negative binomial discrete random variable. |
planck | A Planck discrete exponential random variable. |
poisson | A Poisson discrete random variable. |
randint | A uniform discrete random variable. |
skellam | A Skellam discrete random variable. |
zipf | A Zipf discrete random variable. |
rvs(*args, **kwargs) | Random variates of given type. |
pmf(k, *args, **kwds) | Probability mass function at k of the given RV. |
logpmf(k, *args, **kwds) | Log of the probability mass function at k of the given RV. |
cdf(k, *args, **kwds) | Cumulative distribution function of the given RV. |
logcdf(k, *args, **kwds) | Log of the cumulative distribution function at k of the given RV. |
sf(k, *args, **kwds) | Survival function (1 - cdf) at k of the given RV. |
logsf(k, *args, **kwds) | Log of the survival function of the given RV. |
ppf(q, *args, **kwds) | Percent point function (inverse of cdf) at q of the given RV. |
isf(q, *args, **kwds) | Inverse survival function (inverse of sf) at q of the given RV. |
moment(n, *args, **kwds) | n-th order non-central moment of distribution. |
stats(*args, **kwds) | Some statistics of the given RV. |
entropy(*args, **kwds) | Differential entropy of the RV. |
expect([func, args, loc, lb, ub, ...]) | Calculate expected value of a function with respect to the distribution for discrete distribution. |
median(*args, **kwds) | Median of the distribution. |
mean(*args, **kwds) | Mean of the distribution. |
std(*args, **kwds) | Standard deviation of the distribution. |
var(*args, **kwds) | Variance of the distribution. |
interval(alpha, *args, **kwds) | Confidence interval with equal areas around the median. |
__call__(*args, **kwds) | Freeze the distribution for the given arguments. |
皮皮blog
{scipy.stats顶层函数,可以应用于很多分布的函数}
Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.
describe(a[, axis, ddof, bias, nan_policy]) | Computes several descriptive statistics of the passed array. |
gmean(a[, axis, dtype]) | Compute the geometric mean along the specified axis. |
hmean(a[, axis, dtype]) | Calculates the harmonic mean along the specified axis. |
kurtosis(a[, axis, fisher, bias, nan_policy]) | Computes the kurtosis (Fisher or Pearson) of a dataset. |
kurtosistest(a[, axis, nan_policy]) | Tests whether a dataset has normal kurtosis |
mode(a[, axis, nan_policy]) | Returns an array of the modal (most common) value in the passed array. |
moment(a[, moment, axis, nan_policy]) | Calculates the nth moment about the mean for a sample. |
normaltest(a[, axis, nan_policy]) | Tests whether a sample differs from a normal distribution. |
skew(a[, axis, bias, nan_policy]) | Computes the skewness of a data set. |
skewtest(a[, axis, nan_policy]) | Tests whether the skew is different from the normal distribution. |
kstat(data[, n]) | Return the nth k-statistic (1<=n<=4 so far). |
kstatvar(data[, n]) | Returns an unbiased estimator of the variance of the k-statistic. |
tmean(a[, limits, inclusive, axis]) | Compute the trimmed mean. |
tvar(a[, limits, inclusive, axis, ddof]) | Compute the trimmed variance |
tmin(a[, lowerlimit, axis, inclusive, ...]) | Compute the trimmed minimum |
tmax(a[, upperlimit, axis, inclusive, ...]) | Compute the trimmed maximum |
tstd(a[, limits, inclusive, axis, ddof]) | Compute the trimmed sample standard deviation |
tsem(a[, limits, inclusive, axis, ddof]) | Compute the trimmed standard error of the mean. |
variation(a[, axis, nan_policy]) | Computes the coefficient of variation, the ratio of the biased standard deviation to the mean. |
find_repeats(arr) | Find repeats and repeat counts. |
trim_mean(a, proportiontocut[, axis]) | Return mean of array after trimming distribution from both tails. |
cumfreq(a[, numbins, defaultreallimits, weights]) | Returns a cumulative frequency histogram, using the histogram function. |
histogram2(*args, **kwds) | histogram2 is deprecated! |
histogram(*args, **kwds) | histogram is deprecated! |
itemfreq(a) | Returns a 2-D array of item frequencies. |
percentileofscore(a, score[, kind]) | The percentile rank of a score relative to a list of scores. |
scoreatpercentile(a, per[, limit, ...]) | Calculate the score at a given percentile of the input sequence. |
relfreq(a[, numbins, defaultreallimits, weights]) | Returns a relative frequency histogram, using the histogram function. |
binned_statistic(x, values[, statistic, ...]) | Compute a binned statistic for one or more sets of data. |
binned_statistic_2d(x, y, values[, ...]) | Compute a bidimensional binned statistic for one or more sets of data. |
binned_statistic_dd(sample, values[, ...]) | Compute a multidimensional binned statistic for a set of data. |
obrientransform(*args) | Computes the O’Brien transform on input data (any number of arrays). |
signaltonoise(*args, **kwds) | signaltonoise is deprecated! |
bayes_mvs(data[, alpha]) | Bayesian confidence intervals for the mean, var, and std. |
mvsdist(data) | ‘Frozen’ distributions for mean, variance, and standard deviation of data. |
sem(a[, axis, ddof, nan_policy]) | Calculates the standard error of the mean (or standard error of measurement) of the values in the input array. |
zmap(scores, compare[, axis, ddof]) | Calculates the relative z-scores. |
zscore(a[, axis, ddof]) | Calculates the z score of each value in the sample, relative to the sample mean and standard deviation. |
iqr(x[, axis, rng, scale, nan_policy, ...]) | Compute the interquartile range of the data along the specified axis. |
sigmaclip(a[, low, high]) | Iterative sigma-clipping of array elements. |
threshold(*args, **kwds) | threshold is deprecated! |
trimboth(a, proportiontocut[, axis]) | Slices off a proportion of items from both ends of an array. |
trim1(a, proportiontocut[, tail, axis]) | Slices off a proportion from ONE end of the passed array distribution. |
f_oneway(*args) | Performs a 1-way ANOVA. |
pearsonr(x, y) | Calculates a Pearson correlation coefficient and the p-value for testing non-correlation. |
spearmanr(a[, b, axis, nan_policy]) | Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation. |
pointbiserialr(x, y) | Calculates a point biserial correlation coefficient and its p-value. |
kendalltau(x, y[, initial_lexsort, nan_policy]) | Calculates Kendall’s tau, a correlation measure for ordinal data. |
linregress(x[, y]) | Calculate a linear least-squares regression for two sets of measurements. |
theilslopes(y[, x, alpha]) | Computes the Theil-Sen estimator for a set of points (x, y). |
f_value(*args, **kwds) | f_value is deprecated! |
ttest_1samp(a, popmean[, axis, nan_policy]) | Calculates the T-test for the mean of ONE group of scores. |
ttest_ind(a, b[, axis, equal_var, nan_policy]) | Calculates the T-test for the means of two independent samples of scores. |
ttest_ind_from_stats(mean1, std1, nobs1, ...) | T-test for means of two independent samples from descriptive statistics. |
ttest_rel(a, b[, axis, nan_policy]) | Calculates the T-test on TWO RELATED samples of scores, a and b. |
kstest(rvs, cdf[, args, N, alternative, mode]) | Perform the Kolmogorov-Smirnov test for goodness of fit. |
chisquare(f_obs[, f_exp, ddof, axis]) | Calculates a one-way chi square test. |
power_divergence(f_obs[, f_exp, ddof, axis, ...]) | Cressie-Read power divergence statistic and goodness of fit test. |
ks_2samp(data1, data2) | Computes the Kolmogorov-Smirnov statistic on 2 samples. |
mannwhitneyu(x, y[, use_continuity, alternative]) | Computes the Mann-Whitney rank test on samples x and y. |
tiecorrect(rankvals) | Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests. |
rankdata(a[, method]) | Assign ranks to data, dealing with ties appropriately. |
ranksums(x, y) | Compute the Wilcoxon rank-sum statistic for two samples. |
wilcoxon(x[, y, zero_method, correction]) | Calculate the Wilcoxon signed-rank test. |
kruskal(*args, **kwargs) | Compute the Kruskal-Wallis H-test for independent samples |
friedmanchisquare(*args) | Computes the Friedman test for repeated measurements |
combine_pvalues(pvalues[, method, weights]) | Methods for combining the p-values of independent tests bearing upon the same hypothesis. |
ss(*args, **kwds) | ss is deprecated! |
square_of_sums(*args, **kwds) | square_of_sums is deprecated! |
jarque_bera(x) | Perform the Jarque-Bera goodness of fit test on sample data. |
ansari(x, y) | Perform the Ansari-Bradley test for equal scale parameters |
bartlett(*args) | Perform Bartlett’s test for equal variances |
levene(*args, **kwds) | Perform Levene test for equal variances. |
shapiro(x[, a, reta]) | Perform the Shapiro-Wilk test for normality. |
anderson(x[, dist]) | Anderson-Darling test for data coming from a particular distribution |
anderson_ksamp(samples[, midrank]) | The Anderson-Darling test for k-samples. |
binom_test(x[, n, p, alternative]) | Perform a test that the probability of success is p. |
fligner(*args, **kwds) | Perform Fligner-Killeen test for equality of variance. |
median_test(*args, **kwds) | Mood’s median test. |
mood(x, y[, axis]) | Perform Mood’s test for equal scale parameters. |
boxcox(x[, lmbda, alpha]) | Return a positive dataset transformed by a Box-Cox power transformation. |
boxcox_normmax(x[, brack, method]) | Compute optimal Box-Cox transform parameter for input data. |
boxcox_llf(lmb, data) | The boxcox log-likelihood function. |
entropy(pk[, qk, base]) | Calculate the entropy of a distribution for given probability values. |
chisqprob(*args, **kwds) | chisqprob is deprecated! |
betai(*args, **kwds) | betai is deprecated! |
这个函数的输出太难看了!
age = [23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61] fat_percent = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7] age = np.array(age) fat_percent = np.array(fat_percent) data = np.vstack([age, fat_percent]).reshape([-1, 2])
print(stats.describe(data))
DescribeResult(nobs=18, minmax=(array([ 7.8, 17.8]), array([ 60., 61.])), mean=array([ 37.36111111, 37.86666667]), variance=array([ 236.58604575, 188.78588235]), skewness=array([-0.30733374, 0.40999364]), kurtosis=array([-0.65245849, -1.26315357]))
修改了一个输出结果形式
for key, value in stats.describe(data)._asdict().items(): print(key, ':', value)nobs : 18
也可以使用pandas中的函数进行替代,这样输出比较舒服[python数据处理库pandas]
scipy.stats.entropy(pk, qk=None, base=None)[source]
Calculate the entropy of a distribution for given probability values.
If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).
If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).
This routine will normalize pk and qk if they don’t sum to 1.
香农熵的计算entropy
shannon_entropy = stats.entropy(ij/sum(ij), base=None) print(shannon_entropy)
entropy的python直接实现
shannon_entropy_func = lambda pij: -sum(pij*np.log(pij)) shannon_entropy = shannon_entropy_func(ij[np.nonzero(ij)]) print(shannon_entropy)def entropy(counts):
return H
两个分布的kl散度的计算
kl = sp.stats.entropy(fs_rv_dist, nonfs_rv_dist)
kl散度的其它实现[距离和相似度度量方法]
[scipy.stats.entropy?]
ttest_1samp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var]) Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements
ttest_1samp实现了单样本t检验。因此,如果我们想检验数据Abra列的稻谷产量均值,通过零假设,这里我们假定总体稻谷产量均值为15000,我们有:
from scipy import stats as ss返回下述值组成的元祖:
通过上面的输出,看到p值是0.267远大于α等于0.05,因此没有充分的证据说平均稻谷产量不是150000。将这个检验应用到所有的变量,同样假设均值为15000,我们有:
print ss.ttest_1samp(a = df, popmean = 15000)第一个数组是t统计量,第二个数组则是相应的p值。
皮皮blog
chi2_contingency(observed[, correction, lambda_]) Chi-square test of independence of variables in a contingency table.
contingency.expected_freq(observed) Compute the expected frequencies from a contingency table.
contingency.margins(a) Return a list of the marginal sums of the array a.
fisher_exact(table[, alternative]) Performs a Fisher exact test on a 2x2 contingency table.
ppcc_max(x[, brack, dist]) Returns the shape parameter that maximizes the probability plot correlation coefficient for ppcc_plot(x, a, b[, dist, plot, N]) Returns (shape, ppcc), and optionally plots shape vs.
probplot(x[, sparams, dist, fit, plot]) Calculate quantiles for a probability plot, and optionally show the plot.
boxcox_normplot(x, la, lb[, plot, N]) Compute parameters for a Box-Cox normality plot, optionally show it.
Statistical functions for masked arrays (scipy.stats.mstats)
蒙面统计函数Masked statistics functions argstoarray(*args) Constructs a 2D array from a group of sequences.
betai(a, b, x) Returns the incomplete beta function.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
count_tied_groups(x[, use_missing]) Counts the number of tied values.
describe(a[, axis]) Computes several descriptive statistics of the passed array.
f_oneway(*args) Performs a 1-way ANOVA, returning an F-value and probability given any f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Calculation of Wilks lambda F-statistic for multivariate data, per Maxwell find_repeats(arr) Find repeats in arr and return a tuple (repeats, repeat_count).
friedmanchisquare(*args) Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA.
kendalltau(x, y[, use_ties, use_missing]) Computes Kendall’s rank correlation tau on two variables x and y.
kendalltau_seasonal(x) Computes a multivariate Kendall’s rank correlation tau, for seasonal data.
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
linregress(*args) Calculate a regression line
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney statistic
plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
mquantiles(a[, prob, alphap, betap, axis, limit]) Computes empirical quantiles for a data array.
msign(x) Returns the sign of x, or 0 if x is masked.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
obrientransform(*args) Computes a transform on input data (any number of columns).
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and the associated p-value.
rankdata(data[, axis, use_missing]) Returns the rank (also known as order statistics) of each data point along scoreatpercentile(data, per[, limit, ...]) Calculate the score at the given ‘per’ percentile of the sequence a.
sem(a[, axis, ddof]) Calculates the standard error of the mean (or standard error of measurement) signaltonoise(data[, axis]) Calculates the signal-to-noise ratio, as the ratio of the mean over standard skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
spearmanr(x, y[, use_ties]) Calculates a Spearman rank-order correlation coefficient and the p-value theilslopes(y[, x, alpha]) Computes the Theil slope as the median of all slopes between paired values.
threshold(a[, threshmin, threshmax, newval]) Clip array to a given value.
tmax(a, upperlimit[, axis, inclusive]) Compute the trimmed maximum
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
trim(a[, limits, inclusive, relative, axis]) Trims an array by masking the data outside some given limits.
trima(a[, limits, inclusive]) Trims an array by masking the data outside some given limits.
trimboth(data[, proportiontocut, inclusive, ...]) Trims the smallest and largest data values.
trimmed_stde(a[, limits, inclusive, axis]) Returns the standard error of the trimmed mean along the given axis.
trimr(a[, limits, inclusive, axis]) Trims an array by masking some proportion of the data on each end.
trimtail(data[, proportiontocut, tail, ...]) Trims the data by masking values from one tail.
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis]) Calculates the T-test for the means of TWO INDEPENDENT samples of ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
tvar(a[, limits, inclusive]) Compute the trimmed variance
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation winsorize(a[, limits, inclusive, inplace, axis]) Returns a Winsorized version of the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample
gaussian_kde(dataset[, bw_method]) Representation of a kernel-density estimate using Gaussian kernels.
皮皮blog
{高斯[正态]分布随机变量,A normal continuous random variable.}
参数:
The location (loc) keyword specifies the mean.
The scale (scale) keyword specifies the standard deviation.
norm通过loc和scale参数可以指定随机变量的偏移和缩放参数。 对于正态分布的随机变量来说,这两个参数相当于指定其期望值和标准差。
高斯分布N(0,0.01)随机偏差 y = stats.norm.rvs(loc=0, scale=0.1, size=10)
输出:array([ 0.05419826, 0.04151471, -0.10784729, 0.18283546, 0.02348312, -0.04611974, 0.0069336 , 0.03840133, -0.05015316, 0.23315205])
y.stats()
(array(0.0), array(0.1)
Note: 也可以使用numpy.random.norm函数生成高斯分布随机数[numpy库 - 随机数模块numpy.random]。
>>> X =stats.norm(loc=1.0,scale=2.0,size = 100)
可以使用fit()方法对随机取样序列x进行拟合,返回的是与随机取样值最吻合的随机变量的参数
>>> stats.norm.fit(x) #得到随机序列的期望值和标准差
array([ 1.01810091, 2.00046946])
求正态分布N(1,1)概率密度函数某个x对应的值
lambda x: norm.pdf(x, 1, 1)Note: 从正态分布概率密度中看出,这个和norm.pdf(x - 1)是不一样的,只有标准差为1时才相等。
求正态分布N(1,1)累积分布函数某个x对应的值
lambda x: norm.cdf(x, 1, 1)
[scipy.stats.norm]
mu = uniform.rvs(size=N) # 从均匀分布采样
伽玛分布需要额外的形状参数。伽玛分布可用于描述等待k个独立的随机事件发生所需的时间,k就是伽玛分布的形状参数。
伽玛分布的尺度参数theta和随机事件发生的频率相关,由scale参数指定。
>>> stats.gamma.stats(2.0,scale=2)
(array(4.0), array(8.0))
根据伽玛分布的数学定义可知其期望值为k*theta,方差为k*theta^2 。上面的程序验证了这两个公式。 当随机分布有额外的形状参数时,它所对应的rvs()、pdf()等方法都会增加额外的参数以接收形状参数。
假设有一种只有两个结果的试验,其成功概率为 P,那么二项分布描述了进行n次这样的独立试验而成功k次的概率。
二项分布的概率质量函数公式如下:
使用二项分布的概率质量函数pmf()可以很容易计算出现k次6点的概率。
pmf()的第一个参数为随机变量的取值,后面的参数为描述随机分布所需的参数。对于二项分布来说,参数分别为n和P,而取值范围则为0到n之间的整数。
程序通过二项分布的概率质量公式计算投掷5次骰子出现0到6所对应的概率:
>>> stats.binom.pmf(range(6), 5, 1/6.0)
array([0.401878, 0.401878, 0.166751, 0.032150, 0.003215, 0.000129])
由结果可知:出现0或1次6点的概率为40.2%,而出现3次6点的概率为3.215%
在二项分布中,如果试验次数n很大,而每次试验成功的概率p很小,其乘积np比较适中,那么试验成功次数的概率可以用泊松分布近似描述。
在泊松分布中,使用lambda描述单位时间(或单位面积)内随机事件的平均发生率。如果将二项分布中的试验次数n看作单位时间内所做的试验次数,那么它和事件出现概率P的乘积就是事件的平均发生率,即lambda = np。
泊松分布的概率质量函数公式如下:
泊松分布适合描述单位时间内随机事件发生次数的分布情况。例如某设施在一定时间内的 使用次数。机器出现故障的次数。自然灾害发生的次数等等。
下面使用随机数模拟泊松分布,并与其概率质量函数进行比较,事件每秒的平均发生次数为lambda=10。其中观察时间分别为1000秒,50000秒。可以看出:观察时间越长,事件每秒发生的次数就越符合泊松分布。
>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time )*time
>>> count, time_edges = np.histogram(t, bins=time, range=(0,time))
>>> count
array([10, 9, 8, …, 11, 10, 18])
>>>x = count_edges[:-1]
>>> dist, count_edges = np. histogram (count, bins=20, range= (0,20), normed=True)
>>> poisson = stats .poisson.pmf(x, _lambda)
>>> np.max(np.abs(dist-poisson)) #最大误差很小,符合泊松分布
0.0088356241037075706
Note: 用rand()产生平均分布于0到time之间的_lambda*time 个事件所发生的时刻。
用histogram()可以统计数组t中每秒之内事件发生的次数count。
根据泊松分布的定义,count数组中数值的分布情况应该符合泊松分布。统计事件次数在0到20区间内的概率分布。当histogram()的normed参数为True并且每个统计区间的长度为1时,其结果和概率质量函数相等。
还可以换一个角度看随机事件的分布问题。可以观察相邻两个事件之间时间间隔的分布情况,或者隔k个事件的时间间隔的分布情况。根据概率论,事件之间的时间间隔应符合伽玛分布,由于时间间隔可以是任意数值,因此伽玛分布是一种连续概率分布。伽玛分布的概率密度函数公式如下,它描述第k个亊件发生所需的等待时间的概率分布。伽玛函数,当 k为整数时,它的值和k的阶乘k!相等。
程序模拟事件的时间间隔的伽玛分布,观察时间为1 000秒,平均每秒产生10个事件。
图中“k=1”,它表示相邻两个事件之间的时间间 隔的分布,而“k=2”则表示相隔一个事件的两个事件之间的时间间隔的分布,可以看出它们都符合伽玛分布.
>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time)*time
>>> t.sort()#计算事性前后的时间间隔,需要先对随机时刻进行排序
>>> s1 = t[1:] - t[:-1] #相邻两个事件之间的时间间隔
>>> s2 = t[2:] - t[:-2] #相隔一个事件的两个亊件之间的时间间隔
>>> dist1, x1= np.histogram(s1, bins=100, normed=True)
>>> dist2, x2 = np.histogram(s2 , bins=100, normed=True)
>>> gamma1 = stats.gamma.pdf((x1[:-1]+x1[1:])/2, 1, scale=1.0/_lambda)
>>> gamma2 = stats.gamma.pdf((x2[:-1]+x2[1:])/2, 2, scale=1.0/_lambda)
>>> np.max(np.abs(gamma1 - dist1))
0.13557317865888141
>>> np.max(np.abs(gamma2 - dist2))
0.087375030861794656
>>> np.max(gamma1), np.max(gamma2)
(9.3483221580498537, 3.6767953241013656) #由于概率密度函数的值本身比较大,因此上面的误差已经很小了:
Note:模拟伽玛分布:
首先在10000秒之内产生100000个随机事件发生的时刻.因此事件的平均发生次数为每秒10次;
为了计算事性前后的时间间隔,需要先对随机时刻进行排序;
histogram()返回的第二个值为统计区间的边界,采用gamma.pdf()计算伽玛分布的概率密度时,使用各个区间的中值进行计算。Pdf()的第二个参数为k值,scale参数为1/λ;
from:http://blog.csdn.net/pipisorry/article/details/49515215
ref:Statistical functions (scipy.stats)
python标准库中的随机分布函数