title: The Category-theoretic Perspective of Statistical Learning for Amateurs
author: Congwei Song
description: A representation in BIMSA
Abstract Statistical learning is a fascinating field that has long been the mainstream of machine learning/artificial intelligence. A large number of results have been produced which can be widely applied to real-world problems. It also leads to many research topics and also stimulates new research. This report summarizes some classic statistical learning models and well-known algorithms, especially for amateurs, and provides a category-theoretic perspective on understanding statistical learning models. The goal is to attract researchers from other fields, including basic mathematics, to participate in the research related to statistical learning.
Keywords Statistical Learning, Statistics, Category Theory, Variational Models, Neural Networks, Deep Learning
Abbreviations
distr.: distribution(s)
var.: variable(s)
cat.: category(ies)
rv: random varible(s)
Notations
Definition (Probability Model)
The probability model is a probability measure sp, denoted by ( Ω , A , P ) (\Omega, \mathcal{A},P) (Ω,A,P); or ( X , P ( X ) ) (\mathcal{X}, P(X)) (X,P(X)), as its pushback by the rv X X X, where X \mathcal{X} X is the sample space of X X X.
X ∼ P X\sim P X∼P: the distr. of X X X is P P P, or draw X X X from P P P
Definition (Statistical Model)
The statistical model is a family/set of probability models denoted by ( Ω , A , { P λ } ) (\Omega, \mathcal{A},\{P_\lambda\}) (Ω,A,{Pλ}) (with common ambient space) or ( X , { P λ } ) (\mathcal{X},\{P_\lambda\}) (X,{Pλ}) (with common sample space which is the range of a target rv X X X), denoted by P ( X ) P(X) P(X) for short.
Parameterized version: ( X , { P ( X ∣ θ ) } , θ ∈ Θ ) (\mathcal{X},\{P(X|\theta)\},\theta\in\Theta) (X,{P(X∣θ)},θ∈Θ) where Θ \Theta Θ is the parameter space, denoted by P ( X ∣ θ ) P(X|\theta) P(X∣θ) for short.
Example
N ( μ , σ 2 ) , C a t ( p ) N(\mu,\sigma^2),Cat(p) N(μ,σ2),Cat(p)
Definition (Baysian Model)
THe Baysian model is a statistical model with priori distr. of parameters, as
( M θ , p ( θ ) ) (M_\theta,p(\theta)) (Mθ,p(θ)), where M θ M_\theta Mθ is a given statistical model.
Definition (Baysian Hierachical Model)
( M θ , p ( θ ∣ α ) , p ( α ) ) (M_\theta,p(\theta|\alpha),p(\alpha)) (Mθ,p(θ∣α),p(α))
S t a t \mathcal{Stat} Stat is regarded as a sub-cat. of S t a t \mathcal{Stat} Stat with the flatten priori. The Bayesian model gives joint P ( x , θ ) P(x,\theta) P(x,θ)。Therefore the category of the Bayesian models is a sub-cat. of P r o b Prob Prob.
Definition (Statistical model with an estimator)
model with estimator: ( M θ , θ ^ ( X ) ) (M_\theta, \hat\theta(X)) (Mθ,θ^(X)) where θ ^ : X N → Θ \hat\theta: \mathcal{X}^N\to \Theta θ^:XN→Θ, X X X is a sample with size N N N.
In most case, we use MLE. The estimator has been implied by model.
supervised learning (determinant form):
( X , Y , P ( Y ∣ X ) ) (\mathcal{X},\mathcal{Y}, P(Y|X)) (X,Y,P(Y∣X)) where X X X is input (conditional var.), Y Y Y is output
Supervised learning based on sample X = { X i } X=\{X_i\} X={Xi}, is identified with statistical model:
( Y N , P ( Y ∣ X ) = ∏ i P ( y i ∣ x i ) ) (\mathcal{Y}^N, P(Y|X)=\prod_iP(y_i|x_i)) (YN,P(Y∣X)=i∏P(yi∣xi))
where the sample X X X is fixed, named design var. (design matrix, if it forms a matrix) and Y Y Y is a sample point (sample with size 1).
I claim: statistical learning == conditionalized statistics (model)
Facts in statistical learning are also facts in statistics
Bias-Variance Decomposition in statistics: Error = Bias + Variance; in statistical learning: Error = Bias + Variance under condition of input variable;
A learner is an estimator for statistical model:
( M θ , θ ^ ( X ) ) (M_\theta, \hat{\theta}(X)) (Mθ,θ^(X))
where M θ M_\theta Mθ is a statistical model.
One can define a latent model as ( P ( X , Z ) , θ ^ ( X ) ) (P(X,Z), \hat{\theta}(X)) (P(X,Z),θ^(X)) for unsupervised learning; ( P ( X , Y ) , θ ^ ( X , Y ) ) (P(X,Y), \hat{\theta}(X,Y)) (P(X,Y),θ^(X,Y)) for supervised learning.
The classical models are all categories. And we have a diagram about them. I’d like to call it “the beginners’ magic cube”, since it looks like a cube and the beginner of SL should learn them first.
Another way to describe the statistical (learning) model.
Methods as Functors:
Neural Models: Models equipped with Neural Networks; apply neural network in statistical models.
Take VAE as an example
P ( X ) ∼ N ( μ , σ 2 ) → P ( X ∣ Z ) ∼ N ( f ( z ) , σ 2 ) , P ( Z ) ∼ N ( 0 , 1 ) → ( P ( X , Z ) , Q ( Z ∣ X = x ) ∼ N ( g ( x ) , h ( x ) ) ) → ( P ( X , Z ) , Q ( Z ∣ X = x ) = g ( x ) + ξ h ( x ) ) , ξ ∼ N ( 0 , 1 ) P(X)\sim N(\mu,\sigma^2) \to P(X|Z)\sim N(f(z),\sigma^2),P(Z)\sim N(0,1)\\ \to (P(X,Z),Q(Z|X=x)\sim N(g(x),h(x)))\\ \to (P(X,Z),Q(Z|X=x) = g(x)+\xi h(x)),\xi\sim N(0,1) P(X)∼N(μ,σ2)→P(X∣Z)∼N(f(z),σ2),P(Z)∼N(0,1)→(P(X,Z),Q(Z∣X=x)∼N(g(x),h(x)))→(P(X,Z),Q(Z∣X=x)=g(x)+ξh(x)),ξ∼N(0,1)
Write it in the style of the composition of functors (informally)
V A E ( f , g , h ) = R e p ∘ V a r ∘ L V M ( P ( X ) ) VAE(f,g,h) = \mathrm{Rep}\circ \mathrm{Var}\circ\mathrm{LVM}(P(X)) VAE(f,g,h)=Rep∘Var∘LVM(P(X)), regarding functions f , g , h f,g,h f,g,h as parameters.
The implimentation of VAE by the following NN (with a regularizing term):
y ∼ f ( g ( x ) + h ( x ) ξ ) , y \sim f(g(x)+h(x)\xi), y∼f(g(x)+h(x)ξ),
through self-supervised learning with data { ( x i , x i ) } \{(x_i,x_i)\} {(xi,xi)}, where f , g , h f,g,h f,g,h are all neural layers, ξ ∼ N ( 0 , 1 ) \xi\sim N(0,1) ξ∼N(0,1) is the perturbation variable of the hidden layer g g g. when ξ → 0 \xi\to 0 ξ→0, Q Q Q is degenerated, and VAE degenerates to an ordinary NN f ( g ( x ) ) f(g(x)) f(g(x)).
Take RNN as an example
P ( Y ) → ⋯ → P ∗ ( Y , Z ∣ X ) → y t = N e t ( x t , z t − 1 ) , z t = N e t ( x t , z t − 1 ) P(Y)\to \cdots \to P^*(Y,Z|X)\\ \to y_t=Net(x_t,z_{t-1}),z_{t}=Net(x_t,z_{t-1}) P(Y)→⋯→P∗(Y,Z∣X)→yt=Net(xt,zt−1),zt=Net(xt,zt−1)
say R N N ( w ) = N N ∘ T S ∘ C o n d i ∘ L V M ( P ) RNN(w) = \mathrm{NN\circ TS\circ Condi\circ LVM}(P) RNN(w)=NN∘TS∘Condi∘LVM(P).
Homework
Possible definition of Transfer Learning
Inspired by BiLSTM/BiLM: Tied Model
( P ( X ∣ θ 1 , θ 0 ) , P ( X ∣ θ 2 , θ 0 ) ) (P(X|\theta_1,\theta_0),P(X|\theta_2,\theta_0)) (P(X∣θ1,θ0),P(X∣θ2,θ0)) with the same sample.
Tied likelihood: P ( X ∣ θ 1 , θ 0 ) P ( X ∣ θ 2 , θ 0 ) P(X|\theta_1,\theta_0)P(X|\theta_2,\theta_0) P(X∣θ1,θ0)P(X∣θ2,θ0)
(a sort of pseudo-likelihood; a product of expert without normalized coef.)
链接: https://pan.baidu.com/s/1GdPiVGG3GIKVS4nWqlBm-w?pwd=1111 提取码: 1111