逻辑回归 概率回归
There is an interesting dichotomy in the world of data science between machine learning practitioners (increasingly synonymous with deep learning practitioners), and classical statisticians (both Frequentists and Bayesians). There is generally no overlap between the techniques used in these two camps. However, there are some interesting tools and libraries that are trying to bridge the gap between the two camps, especially using Bayesian inference techniques to estimate the uncertainty of deep learning models. See this post and this paper to know more about the historical and recent trends in this exciting new area. The biggest benefit to adopting Bayesian thinking is it forces us to explicitly layout all the assumptions that go into the model. It is hard to perform Bayesian inference without fully being aware of all the modeling choices throughout the way. The biggest downside to Bayesian inference is the time needed to run even moderately sized models.
在机器学习从业者(越来越多地与深度学习从业者同义)与古典统计学家(包括频率论者和贝叶斯主义者)之间,数据科学领域存在着一种有趣的二分法。 在这两个阵营中使用的技术之间通常没有重叠。 但是,有一些有趣的工具和库正试图弥合两个阵营之间的鸿沟,尤其是使用贝叶斯推理技术来估计深度学习模型的不确定性。 请参阅这篇文章和本文,以了解有关这个令人兴奋的新领域的历史和最近趋势的更多信息。 采用贝叶斯思想的最大好处是,它迫使我们明确设计模型中的所有假设。 在没有完全了解整个过程中所有建模选择的情况下,很难执行贝叶斯推理。 贝叶斯推断最大的缺点是运行中等大小的模型所需的时间。
There are several probabilistic programming languages/frameworks out there that are becoming more popular due to the recent advances in computing hardware. The most common and mature language is Stan which has APIs to work with other common programming languages like Python (PyStan) and R (RStan). There are also some newer players in the field like PyMC3 (Theano), Pyro (PyTorch), and Turing (Julia). Of these, Turing, written in Julia potentially seems to be an interesting option. It brings with it all the advantages of Julia, and combining it with Flux can theoretically make it “easy” to estimate the uncertainties of any deep learning model.
由于计算硬件的最新发展,有几种概率性编程语言/框架正在变得越来越流行。 最常见和最成熟的语言是Stan,它具有与其他常见编程语言(例如Python( PyStan )和R( RStan ))一起使用的API。 该领域中还有一些较新的玩家,例如PyMC3 (Theano), Pyro (PyTorch)和Turing (Julia)。 其中,用Julia(Julia)编写的图灵似乎是一个有趣的选择。 它带来了Julia的所有优点 ,并且将其与Flux结合使用在理论上可以很轻松地估计任何深度学习模型的不确定性。
There are some amazing books to get you up and running with Bayesian data analysis and the bible in the field is definitely the book by the great Andrew Gelman. He also writes short articles/opinions on his blog which is worth following. I personally think the book “Statistical Rethinking” by Richard McElreath is the best introduction to the field for any newcomer. He walks you from the garden of forking paths all the way to multi-level models. He even has his entertaining and engaging lectures up on Youtube! No reason not to get your daily dose of Bayesian
有一些很棒的书可以帮助您开始使用贝叶斯数据分析,并且该领域的圣经绝对是伟大的安德鲁·盖尔曼(Andrew Gelman) 所著的书 。 他还在自己的博客上写了一些简短的文章/观点,值得关注。 我个人认为,Richard McElreath撰写的“ Statistical Rethinking”一书对于任何新手来说都是该领域的最佳介绍。 他会带您从分叉路径的花园一直到多层模型。 他甚至在YouTube上进行有趣而有趣的演讲 ! 没有理由不每天服用贝叶斯
In this blog post, I just wanted to get my feet wet with Julia and Turing. I will use both PyStan and Turing to build multi-category logistic models to predict the species of penguins based on their features like bill-length, island, sex, etc. This is similar to the more popular Iris dataset that is used so commonly in data science tutorials. For more details on the Palmer penguin dataset see here.
在这篇博客中,我只是想和Julia和Turing在一起。 我将同时使用PyStan和Turing来建立多类别的物流模型,根据帐单长度,岛屿,性别等特征来预测企鹅的种类。这类似于在Iso中常用的更流行的Iris数据集。数据科学教程。 有关Palmer企鹅数据集的更多详细信息,请参见此处 。
y斯坦 (PyStan)
First, let's use PyStan to build a multi-logit model. Code for the Stan model looks like this:
首先,让我们使用PyStan构建多登录模型。 Stan模型的代码如下所示:
data {
int N; //the number of training observations
int N2; //the number of test observations
int D; //the number of features
int K; //the number of classes
int y[N]; //the response
matrix[N,D] x; //the model matrix
matrix[N2,D] x_new; //the matrix for the predicted values
}
parameters {
matrix[D,K] beta; //the regression parameters
}
model {
matrix[N, K] x_beta = x * beta;
to_vector(beta) ~ normal(0, 1);
for (n in 1:N)
y[n] ~ categorical_logit(x_beta[n]');
}
This is exactly similar to the example in Stan’s documentation. We are using a standard normal prior on all parameters. In the case of our penguin dataset, we have a total of 9 different features; four of them are continuous features namely bill-length, bill-depth, flipper-length, and body-mass, and 5 are one-hot encoded features for the island and sex categorical values. Therefore, the number of parameters to estimate is 9 per category. Since we have 3 categories, that would be a total of 27 parameters to estimate. For each category, the sum of the coefficients and the feature values are calculated:
这与Stan文档中的示例完全相似。 我们在所有参数上都使用标准普通优先级。 就我们的企鹅数据集而言,我们共有9种不同的功能; 其中四个是连续特征,即钞票长度,钞票深度,鳍状肢长度和身体质量,另外五个是岛和性别分类值的一键编码特征。 因此,每个类别要估计的参数数量为9。 由于我们有3个类别,因此总共需要估算27个参数。 对于每个类别,计算系数和特征值的总和:
The final category for each data point is computed using softmax:
使用softmax计算每个数据点的最终类别:
We could have also let the parameters for one category to be all zeros, and only estimate the remaining 9*2 parameters. This is the same idea as the binary classification models, where we only have one coefficient present:
我们也可以让一个类别的参数全为零,而仅估计剩余的9 * 2参数。 这与二进制分类模型的想法相同,在二进制分类模型中,我们只有一个系数:
I will show how that looks like when we get to the Julia code using the Turing library
我将展示使用图灵库访问Julia代码时的情况
Now we have the model ready, let's go ahead and perform sampling to get the posteriors for all the parameters:
现在我们已经准备好模型,让我们继续进行采样以获取所有参数的后验:
These are the parameters for Sampling:
这些是用于采样的参数:
Algorithm: No-U-Turn Sampler (NUTS)
算法:禁止掉头采样器(NUTS)
Warmup: 500 iterations
预热:500次迭代
Samples: 500 iterations
样本:500次迭代
Chains: 4
链数:4
Max Tree Depth: 10
最大树深:10
Time elapsed per chain: ~140 seconds
每条链经过的时间:〜140秒
The chains show poor mixing and stability, and the recommendation from Stan is to go higher with the max tree depth for the NUTS sampler to get better stability between and across chains
链条显示出不良的混合和稳定性,Stan的建议是增加NUTS采样器的最大树深度,以在链条之间和跨链获得更好的稳定性。
The poor stability of the chains is also reflected in the number of effective samples (n_eff), which is quite low for some parameters. The Rhat is significantly above the recommended value of 1.05 for most parameters.
链的不良稳定性还反映在有效样本数(n_eff)中,对于某些参数而言,该数目非常低。 对于大多数参数,Rhat明显高于建议值1.05。
In general though, this is not generally an issue for most cases and the samples are usable as is shown below for predicting the train and test set classes
通常,在大多数情况下,这通常不是问题 ,并且可以使用样本,如下所示,用于预测训练和测试集的类别
Now, lets increase the maximum tree depth for the NUTS sample from 10 to 12. This increases the time taken for each chain to converge
现在,让NUTS样本的最大树深度从10增加到12。这增加了每个链收敛所需的时间。
Max Tree Depth: 12
最大树深:12
Time elapsed per chain: ~570 seconds
每条链经过的时间:〜570秒
The chains show much better mixing and stability, and we could still go higher with the max tree depth for the NUTS sampler to get better stability between and across chains
链条显示出更好的混合和稳定性,对于NUTS采样器,我们仍然可以使用最大树深度来提高链条之间和跨链条的稳定性。
As we can see, the number of effective samples (n_eff) has also increased considerably for some parameters, and the Rhat is approaching the recommended value of 1.05 for some parameters. These samples as expected provide good classification predictions
如我们所见,某些参数的有效样本数(n_eff)也大大增加,Rhat接近某些参数的建议值1.05。 预期这些样本提供了良好的分类预测
Increasing the max tree depth further to 15 significantly improves the chain stability (data not shown) but also increases the computational time ~25 fold.
将最大树深度进一步增加到15,可以显着改善链的稳定性(数据未显示),但还会增加约25倍的计算时间。
The code for running the above models is here. For the full project that includes setup for AWS, Sagemaker, and XGBoost models refer to my earlier blog post and Github repo.
运行上述模型的代码在这里 。 有关包含适用于AWS,Sagemaker和XGBoost模型的设置的完整项目,请参阅我先前的博客文章和Github repo 。
Julia: (Julia:)
Now, I will show you the equivalent model using Julia and Turing. The code can be found here in the main project repo. The model is defined like so:
现在,我将向您展示使用Julia和Turing的等效模型。 该代码可以发现这里的主要项目回购。 该模型的定义如下:
@model logistic_regression(x, y, n, σ) = begin
intercept_Adelie ~ Normal(0, σ)
intercept_Gentoo ~ Normal(0, σ)
intercept_Chinstrap ~ Normal(0, σ) bill_length_mm_Adelie ~ Normal(0, σ)
bill_length_mm_Gentoo ~ Normal(0, σ)
bill_length_mm_Chinstrap ~ Normal(0, σ) bill_depth_mm_Adelie ~ Normal(0, σ)
bill_depth_mm_Gentoo ~ Normal(0, σ)
bill_depth_mm_Chinstrap ~ Normal(0, σ) flipper_length_mm_Adelie ~ Normal(0, σ)
flipper_length_mm_Gentoo ~ Normal(0, σ)
flipper_length_mm_Chinstrap ~ Normal(0, σ) body_mass_g_Adelie ~ Normal(0, σ)
body_mass_g_Gentoo ~ Normal(0, σ)
body_mass_g_Chinstrap ~ Normal(0, σ) island_Biscoe_Adelie ~ Normal(0, σ)
island_Biscoe_Gentoo ~ Normal(0, σ)
island_Biscoe_Chinstrap ~ Normal(0, σ)
island_Dream_Adelie ~ Normal(0, σ)
island_Dream_Gentoo ~ Normal(0, σ)
island_Dream_Chinstrap ~ Normal(0, σ)
island_Torgersen_Adelie ~ Normal(0, σ)
island_Torgersen_Gentoo ~ Normal(0, σ)
island_Torgersen_Chinstrap ~ Normal(0, σ) sex_female_Adelie ~ Normal(0, σ)
sex_female_Gentoo ~ Normal(0, σ)
sex_female_Chinstrap ~ Normal(0, σ)
sex_male_Adelie ~ Normal(0, σ)
sex_male_Gentoo ~ Normal(0, σ)
sex_male_Chinstrap ~ Normal(0, σ)for i = 1:n
v = softmax([intercept_Adelie +
bill_length_mm_Adelie*x[i, 1] +
bill_depth_mm_Adelie*x[i, 2] +
flipper_length_mm_Adelie*x[i, 3] +
body_mass_g_Adelie*x[i, 4] +
island_Biscoe_Adelie*x[i, 5] +
island_Dream_Adelie*x[i, 6] +
island_Torgersen_Adelie*x[i, 7] +
sex_female_Adelie*x[i,8] +
sex_male_Adelie*x[i,9],
intercept_Gentoo +
bill_length_mm_Gentoo*x[i, 1] +
bill_depth_mm_Gentoo*x[i, 2] +
flipper_length_mm_Gentoo*x[i, 3] +
body_mass_g_Gentoo*x[i, 4] +
island_Biscoe_Gentoo*x[i, 5] +
island_Dream_Gentoo*x[i, 6] +
island_Torgersen_Gentoo*x[i, 7] +
sex_female_Gentoo*x[i,8] +
sex_male_Gentoo*x[i,9],
intercept_Chinstrap + bill_length_mm_Chinstrap*x[i, 1] +
bill_depth_mm_Chinstrap*x[i, 2] +
flipper_length_mm_Chinstrap*x[i, 3] +
body_mass_g_Chinstrap*x[i, 4] +
island_Biscoe_Chinstrap*x[i, 5] +
island_Dream_Chinstrap*x[i, 6] +
island_Torgersen_Chinstrap*x[i, 7] +
sex_female_Chinstrap*x[i,8] +
sex_male_Chinstrap*x[i,9]])
y[i, :] ~ Multinomial(1, v)
end
end;
I used the default HMC sampler as recommended in the Turing tutorial. One thing that I noticed is the much better stability of the chains when using the HMC sampler from Turing:
我使用了图灵教程中推荐的默认HMC采样器。 我注意到的一件事是,使用图灵的HMC采样器时,链条的稳定性更好:
And the summary of the samples:
以及样本摘要:
Overall, the HMC samples from Turing seem to do a lot better compared to the NUTS samples from PyStan. Of course, this is not an apples-to-apples comparison, but these are interesting results. In addition, the HMC sampler also was much faster compared to the max_tree_depth=12 run from PyStan shown above. This is something to dig into more.
总体而言,与来自PyStan的NUTS样本相比,来自Turing的HMC样本似乎要好得多。 当然,这不是一个苹果对苹果的比较,但这是有趣的结果。 此外,与上面显示的PyStan运行的max_tree_depth = 12相比,HMC采样器还快得多。 这是需要进一步研究的东西。
The predictions from Turing are perfect on both the Training and Test sets as expected since this is an easy prediction problem.
Turing的预测在训练和测试集上都是理想的,因为这是一个容易预测的问题。
In conclusion, I like Julia and Turing so far! Another great (and fast) tool for Probabilistic Programming!
总之,到目前为止,我喜欢Julia和图灵! 概率编程的另一个出色(快速)工具!
Some good things:
一些好东西:
- Turing is fast! (at least in this example with default samplers) 图灵快! (至少在此示例中使用默认采样器)
- 1-based indexing for Julia and Turing vs Python’s 0-based indexing which makes it harder to co-ordinate with Stan’s 1-based indexing Julia和Turing的基于1的索引与Python的基于0的索引相比,这使得与Stan的基于1的索引更难以协调
- Symbolic math ability with Turing and Julia 图灵和Julia的符号数学能力
Some disadvantages compared to PyStan:
与PyStan相比,有些缺点:
- Not enough libraries to make pre-processing easy 没有足够的库来简化预处理
- Stan has parsimonous model declaration syntax compared to Turing (probably just my ignorance with Turing) 与Turing相比,Stan具有简约的模型声明语法(可能只是我对Turing的无知)
Not a straightforward way to combine with Python (PyJulia is an option worth exploring)
这不是与Python结合的直接方法( PyJuli a是值得探索的选择)
*****************************************************************
****************************************************** ***************
翻译自: https://medium.com/swlh/multi-logistic-regression-with-probabilistic-programming-db9a24467c0d
逻辑回归 概率回归