本文讲解了使用PyMC3进行基本的贝叶斯统计分析过程.
最近我们被客户要求撰写关于贝叶斯统计分析的研究报告,包括一些图形和统计输出。
# 导入
import pymc3 as pm # python的概率编程包
import numpy.random as npr # numpy是用来做科学计算的
import matplotlib.pyplot as plt # matplotlib是用来画图的
import matplotlib as mpl
"真实值是否等于X?"
或者说
"给定数据,对于感兴趣的参数,可能值的概率分布是多少?"
我把我的硬币抛了 n次,正面是 h次。 这枚硬币是有偏的吗?
# 产生所需要的数据
from random import shuffle
total = 30
n_heads = 11
n_tails = total - n_heads
tosses = [1] * n_heads + [0] * n_tails
shuffle(tosses)
fig = plot_coins()
plt.show()
100%|██████████| 2500/2500 [00:00<00:00, 3382.23it/s]
pm.traceplot(coin_trace)
plt.show()
In [10]:
plt.show()
我有一个新开发的X; X在阻止流感病毒复制方面有多好?
测试X的浓度范围, 测量流感活性
计算 IC50: 能够抑制病毒复制活性50%的X浓度.
import pandas as pd
chem_df = pd.DataFrame(chem_data)
chem_df.columns = ['concentration', 'activity']
chem_df['concentration_log'] = chem_df['concentration'].apply(lambda x:np.log10(x))
# df.set_index('concentration', inplace=True)
给定数据, 求出化学物质的IC50值是多少, 并且求出置信区间( 原文中the uncertainty surrounding it, 后面看类似置信区间的含义)?
In [13]:
fig = plot_chemical_data(log=True)
plt.show()
In [16]:
pm.traceplot(ic50_trace[2000:], varnames=['IC50_log10', 'IC50']) # 实时:从步骤2000开始的示例。
plt.show()
In [17]:
pm.plot_posterior(ic50_trace[4000:], varnames=['IC50'],
color='#87ceeb', point_estimate='mean')
plt.show()
该化学物质的 IC50 大约在[2 mM, 2.4 mM] (95% HPD). 这不是个好的药物候选者.
"实验组和对照组之间是否有差别? "
药品治疗是否影响(提高)IQ分数?
def ECDF(data):
x = np.sort(data)
y = np.cumsum(x) / np.sum(x)
return x, y
def plot_drug():
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
x_drug, y_drug = ECDF(drug)
ax.plot(x_drug, y_drug, label='drug, n={0}'.format(len(drug)))
x_placebo, y_placebo = ECDF(placebo)
ax.plot(x_placebo, y_placebo, label='placebo, n={0}'.format(len(placebo)))
ax.legend()
ax.set_xlabel('IQ Score')
ax.set_ylabel('Cumulative Frequency')
ax.hlines(0.5, ax.get_xlim()[0], ax.get_xlim()[1], linestyle='--')
return fig
Out[19]:
Ttest_indResult(statistic=2.2806701634329549, pvalue=0.025011500508647616)
治疗组
vs. 安慰剂组
以下为t分布的几个参数:
In [20]:
fig = plot_drug()
plt.show()
In [21]:
y_vals = np.concatenate([drug, placebo])
labels = ['drug'] * len(drug) + ['placebo'] * len(placebo)
data = pd.DataFrame([y_vals, labels]).T
data.columns = ['IQ', 'treatment']
In [24]:
pm.traceplot(kruschke_trace[2000:],
varnames=['mu_drug', 'mu_placebo'])
plt.show()
In [25]:
pm.plot_posterior(kruschke_trace[2000:], color='#87ceeb',
varnames=['mu_drug', 'mu_placebo', 'diff_means'])
plt.show()
注: IQ的差异在10以上才有点意义. p-value=0.02说明组间有差异, 但没说差异有多大..
In [27]:
ax = adjust_forestplot_for_slides(ax)
plt.show()
森林图:在同一轴上的95%HPD(细线),IQR(粗线)和后验分布的中位数(点),使我们能够直接比较治疗组和对照组。
In [29]:
ax = pm.plot_posterior(kruschke_trace[2000:],
varnames=['effect_size'],
color='#87ceeb')
overlay_effect_size(ax)
比较两种常用的消毒方法, 哪种消毒方法更好
Out[30]:
sample_id int32
treatment int32
colonies_pre int32
colonies_post int32
morphologies_pre int32
morphologies_post int32
year float32
month float32
day float32
perc_reduction morph float32
site int32
phone ID float32
no case float32
frac_change_colonies float64
dtype: object
In [32]:
fig = plot_colonies_data()
plt.show()
菌落计数符合泊松Poisson分布.
In [34]:
with poisson_estimation:
poisson_trace = pm.sample(200000)
Assigned Metropolis to pre_mus
Assigned Metropolis to post_mus
100%|██████████| 200500/200500 [01:15<00:00, 2671.98it/s]
In [35]:
pm.traceplot(poisson_trace[50000:], varnames=['pre_mus', 'post_mus'])
plt.show()
In [39]:
pm.forestplot(poisson_trace[50000:], varnames=['perc_change'],
ylabels=treatment_order) #, xrange=[0, 110])
plt.xlabel('Percentage Reduction')
ax = plt.gca()
ax = adjust_forestplot_for_slides(ax)