Python网络爬虫与文本数据分析
pingouin是基于Pandas和numpy开发的Python3统计包。主要统计功能有
方差分析
多元线性回归
中介效应分析
卡方检验
Q-Q图
贝叶斯因子
信效度检验
等等
我是统计小白,看不懂啊;还有很多功能没有列上,感兴趣的统计大神可以看看https://pingouin-stats.org/api.html
安装
pip3 install pingouin
构造实验数据x,y
import numpy as np
#控制代码每次随机状态保持一致
np.random.seed(666)
n=30
mean= [4,5]
cov = [(1, 0.6), (0.6, 1)]
x, y = np.random.multivariate_normal(mean, cov, n).T
x
array([3.04817645, 2.54387965, 4.56033188, 4.40504338, 3.77876203,
3.87177128, 3.4546112 , 4.47317551, 5.23133856, 5.40273745,
5.19344217, 3.37061786, 3.23980982, 2.85574177, 4.67728276,
4.31935242, 4.39440207, 3.87458876, 4.91426293, 3.13673286,
3.73459839, 4.18708647, 5.48558345, 3.7066784 , 3.73400287,
3.49664637, 3.95954844, 2.61545452, 5.11352964, 5.62666503])
y
array([4.47747109, 4.35695696, 5.46239455, 4.56091782, 4.07534588,
4.03904897, 3.79549165, 5.06121364, 5.71635355, 6.60772697,
6.94890455, 5.13347618, 5.41207983, 3.38254684, 5.49705058,
5.93394729, 4.65224366, 4.59491971, 5.17926604, 4.25844527,
5.72809738, 5.14997732, 5.27606588, 4.94570454, 6.02889647,
5.85451666, 4.90231286, 4.69242625, 4.69367432, 6.71644528])
import matplotlib.pyplot as plt
plt.hist(x, bins=10)
plt.hist(y, bins=10)
import pingouin as pg
pg.ttest(x, y)
T | dof | tail | p-val | CI95% | cohen-d | BF10 | power | |
---|---|---|---|---|---|---|---|---|
T-test | -4.597628 | 58 | two-sided | 0.000024 | [-1.47, -0.58] | 1.187102 | 786.346 | 0.994771 |
pg.corr(x, y)
n | r | CI95% | r2 | adj_r2 | p-val | BF10 | power | |
---|---|---|---|---|---|---|---|---|
pearson | 30 | 0.60149 | [0.31, 0.79] | 0.36179 | 0.314515 | 0.000439 | 82.116 | 0.955747 |
#添加一个异常值
x[5] = 18
#使用Shepherd's pi correlation
pg.corr(x, y, method="shepherd")
n | outliers | r | CI95% | r2 | adj_r2 | p-val | power | |
---|---|---|---|---|---|---|---|---|
shepherd | 30 | 1 | 0.569458 | [0.26, 0.77] | 0.324283 | 0.274229 | 0.001263 | 0.926066 |
pg.normality(x)
W | pval | normal | |
---|---|---|---|
0 | 0.970533 | 0.553863 | True |
pg.normality(y)
W | pval | normal | |
---|---|---|---|
0 | 0.985161 | 0.939893 | True |
pg.multivariate_normality(np.column_stack((x, y)))
(True, 0.6257634649268228)
import numpy as np
import pingouin as pg
np.random.seed(666)
x = np.random.normal(size=50)
ax = pg.qqplot(x, dist='norm')
# 读取数据
df = pg.read_dataset('mixed_anova')
df.sample(10)
Scores | Time | Group | Subject | |
---|---|---|---|---|
142 | 6.502562 | January | Meditation | 52 |
55 | 5.355380 | January | Control | 25 |
70 | 4.714565 | June | Control | 10 |
167 | 6.586494 | June | Meditation | 47 |
169 | 7.388138 | June | Meditation | 49 |
107 | 5.031982 | August | Meditation | 47 |
135 | 4.837971 | January | Meditation | 45 |
163 | 5.483801 | June | Meditation | 43 |
37 | 5.177205 | January | Control | 7 |
4 | 4.779411 | August | Control | 4 |
# Run the ANOVA
aov = pg.anova(data=df,
dv='Scores', #因变量
between='Group',
detailed=True)
aov
Source | SS | DF | MS | F | p-unc | np2 | |
---|---|---|---|---|---|---|---|
0 | Group | 5.459963 | 1 | 5.459963 | 5.243656 | 0.0232 | 0.028616 |
1 | Within | 185.342729 | 178 | 1.041251 | NaN | NaN | NaN |
pg.rm_anova(data=df,
dv='Scores',
within='Time',
subject='Subject',
detailed=True)
Source | SS | DF | MS | F | p-unc | np2 | eps | |
---|---|---|---|---|---|---|---|---|
0 | Time | 7.628428 | 2 | 3.814214 | 3.912796 | 0.022629 | 0.062194 | 0.998751 |
1 | Error | 115.027023 | 118 | 0.974805 | NaN | NaN | NaN | NaN |
# Compute the two-way mixed ANOVA and export to a .csv file
aov = pg.mixed_anova(data=df,
dv='Scores',
between='Group',
within='Time',
subject='Subject',
correction=False,
effsize="np2")
pg.print_table(aov)
=============
ANOVA SUMMARY
=============
Source SS DF1 DF2 MS F p-unc np2 eps
----------- ----- ----- ----- ----- ----- ------- ----- -------
Group 5.460 1 58 5.460 5.052 0.028 0.080 nan
Time 7.628 2 116 3.814 4.027 0.020 0.065 0.999
Interaction 5.167 2 116 2.584 2.728 0.070 0.045 nan
pg.linear_regression(data[['X', 'Z']], data['Y'])
names | coef | se | T | pval | r2 | adj_r2 | CI[2.5%] | CI[97.5%] | |
---|---|---|---|---|---|---|---|---|---|
0 | Intercept | 2.916901 | 1.444715 | 2.019015 | 0.053516 | 0.26855 | 0.214368 | -0.047409 | 5.881210 |
1 | X | 0.610580 | 0.202261 | 3.018775 | 0.005487 | 0.26855 | 0.214368 | 0.195575 | 1.025584 |
2 | Z | -0.007227 | 0.192089 | -0.037624 | 0.970264 | 0.26855 | 0.214368 | -0.401361 | 0.386907 |
pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)
path | coef | se | pval | CI[2.5%] | CI[97.5%] | sig | |
---|---|---|---|---|---|---|---|
0 | Z ~ X | -0.287032 | 0.191454 | 0.145006 | -0.679207 | 0.105142 | No |
1 | Y ~ Z | -0.165299 | 0.209888 | 0.437572 | -0.595235 | 0.264637 | No |
2 | Total | 0.612654 | 0.191099 | 0.003354 | 0.221205 | 1.004103 | Yes |
3 | Direct | 0.610580 | 0.202261 | 0.005487 | 0.195575 | 1.025584 | Yes |
4 | Indirect | 0.002074 | 0.042262 | 0.976000 | -0.088619 | 0.092009 | No |
pandas.DataFrame可直接使用Pingouin的很多统计方法,例如
import pingouin as pg
# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)
# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])
# Example 3 | Partial correlation matrix
data.pcorr()
pandas.DataFrame支持的pingouin统计方法有:
pingouin.anova()
pingouin.ancova()
pingouin.rm_anova()
pingouin.mixed_anova()
pingouin.welch_anova()
pingouin.pairwise_ttests()
pingouin.pairwise_tukey()
pingouin.pairwise_corr()
pingouin.partial_corr()
pingouin.pcorr()
pingouin.rcorr()
pingouin.mediation_analysis()
R语言 | 读写txt、csv、excel文件
R语言 | 数据操作tidyr包
R语言 | 数据操作dplyr包
R语言 | jiebaR中文分词包
Python相关[更新] Python网络爬虫与文本数据分析
读完本文你就了解什么是文本分析
文本分析在经管领域中的应用概述
综述:文本分析在市场营销研究中的应用
plotnine: Python版的ggplot2作图库
小案例: Pandas的apply方法
stylecloud:简洁易用的词云库
用Python绘制近20年地方财政收入变迁史视频
Wow~70G上市公司定期报告数据集
漂亮~pandas可以无缝衔接Bokeh
YelpDaset: 酒店管理类数据集10+G
后台回复关键词【pingouin】获取本文代码和数据
“分享”和“在看”是更好的支持!