自然语言处理复习提纲

规则方法

  • 规则与程序分离, 程序依据规则解释语言.
  • 词素
    • 英语形态还原
    • 汉语分词 tokenization / segmentation
      • 最大匹配 (正向 / 逆向 / 双向消歧.)
      • 最大最小匹配 (发现歧义)
      • 全切分 / 最大可能切分
  • 词性标注
    • 规则方法 (词典+规则+消歧)

语言模型

高维稀疏

  • Zipf Law
    • frequency * rank = constant
  • 特征选择
    • 互信息
  • 零概率平滑
    • add count p ^ ( w ) = c ( w ) + d Z + ∣ Σ ∣ d \hat{p}(w) = \frac{c(w) + d}{Z + |\Sigma|d} p^(w)=Z+Σdc(w)+d
    • laplace smoothing p ^ ( w t , w t − 1 ) = c ( w t , w t − 1 ) + d p ^ ( w t ) Z + d \hat{p}(w_{t},w_{t-1}) = \frac{c(w_{t},w_{t-1}) + d \hat{p}(w_t)}{Z + d} p^(wt,wt1)=Z+dc(wt,wt1)+dp^(wt)   (回退)
    • linear interpolation p ^ ( w t ∣ w t − 2 , w t − 1 ) = λ 2 p ( w t ∣ w t − 2 , w t − 1 ) + λ 1 p ( w t ∣ w t − 1 ) + λ 0 p ( w t ) \hat{p}(w_{t}|w_{t-2},w_{t-1}) = \lambda_2 p(w_{t}|w_{t-2},w_{t-1}) + \lambda_1 p(w_{t}|w_{t-1}) + \lambda_0 p(w_{t}) p^(wtwt2,wt1)=λ2p(wtwt2,wt1)+λ1p(wtwt1)+λ0p(wt)
  • 评价指标
    • 困惑度 Perplexity p ( x 1 : T ) − 1 T = 1 p ( x 1 : T ) T p(x_{1:T})^{- \frac{1}{T}} = \sqrt[T]{\frac{1}{p(x_{1:T})}} p(x1:T)T1=Tp(x1:T)1

生成模型

  • Naive Bayes
    • arg ⁡ max ⁡ y p ( y ∣ x ) = arg ⁡ max ⁡ y p ( x ∣ y ) p ( y ) p ( x ) = arg ⁡ max ⁡ y p ( x ∣ y ) p ( y ) \arg\max\limits_y p(y|x) = \arg\max\limits_y \frac{p(x|y) p(y)}{p(x)} = \arg\max\limits_y p(x|y) p(y) argymaxp(yx)=argymaxp(x)p(xy)p(y)=argymaxp(xy)p(y)
    • p ( x ∣ y ) = ( ∑ σ ∈ Σ x σ ) ! ∏ σ ∈ Σ x σ ∏ σ ∈ Σ ( p ( σ ∣ y ) ) x σ p(x|y) = \frac{(\sum\limits_{\sigma \in \Sigma} x_\sigma)!}{\prod\limits_{\sigma \in \Sigma} x_\sigma} \prod\limits_{\sigma \in \Sigma} (p(\sigma|y))^{x_\sigma} p(xy)=σΣxσ(σΣxσ)!σΣ(p(σy))xσ   (⚠ ( ∑ σ ∈ Σ x σ { x σ } σ ∈ Σ ) = ( ∑ σ ∈ Σ x σ ) ! ∏ σ ∈ Σ x σ \binom{\sum\limits_{\sigma \in \Sigma} x_\sigma}{\{x_\sigma\}_{\sigma \in \Sigma}} = \frac{(\sum\limits_{\sigma \in \Sigma} x_\sigma)!}{\prod\limits_{\sigma \in \Sigma} x_\sigma} ({xσ}σΣσΣxσ)=σΣxσ(σΣxσ)! 消去顺序性)
    • p ( y ) = M y ∑ y ∈ Y M y p(y) = \frac{M_y}{\sum\limits_{y \in Y} M_y} p(y)=yYMyMy
    • p ( σ ∣ y ) = N y σ ∑ σ ∈ Σ N y σ p(\sigma|y) = \frac{N_{y \sigma}}{\sum\limits_{\sigma \in \Sigma} N_{y \sigma}} p(σy)=σΣNyσNyσ

马尔可夫

  • n阶马尔可夫性质 / n gram
    • p ( x 1 : T ) = ∏ t = 1 T p ( x t ∣ x t − n : t − 1 ) p(x_{1:T}) = \prod\limits_{t = 1}^{T} p(x_t | x_{t-n:t-1}) p(x1:T)=t=1Tp(xtxtn:t1)
    • 频率估计概率
    • 参数化概率, 极大似然估计参数.

神经网络

  • Bengio 2003
    • z = C x 1 : T z = C x_{1:T} z=Cx1:T
    • y = b + W z + U tanh ⁡ ( d + H x ) y = b+Wz+U\tanh(d+Hx) y=b+Wz+Utanh(d+Hx)
    • ℓ = log ⁡ y + ∥ θ ∥ \ell = \log y + \|\theta\| =logy+θ

文本分类

文本表示

  • BOW (bag of words)
    • Σ ∗ → R ∣ Σ ∣ \Sigma^* \to \reals^{|\Sigma|} ΣRΣ
    • Bernoulli
    • Multinomial
    • tfidf
      • tf = 文中词语出现次数 / 文中出现次数最多的词语的出现次数
      • df = 出现词语文章个数 / 文章个数
      • tfidf = tf * log( 1/ df )   (log, 弱化df, 强化tf.)
  • n gram BOW
  • Latent Semantic Index
    • X = [ ∣ ∣ x 1 ⋯ x m ∣ ∣ ] = U Σ V T X = \begin{bmatrix} | & & | \\ x_1 & \cdots & x_m \\ | & & | \\ \end{bmatrix} = U \Sigma V^T X=x1xm=UΣVT
    • x ′ = Σ − 1 U T x x' = \Sigma^{-1} U^T x x=Σ1UTx

线性判别

  • 特征工程 ψ ( x , y ) \psi(x,y) ψ(x,y)
  • 学习权重 p ( x , y ) = 1 Z exp ⁡ ( w T ψ ( x , y ) ) p(x,y) = \frac{1}{Z} \exp(w^T \psi(x,y)) p(x,y)=Z1exp(wTψ(x,y))

序列化标注 (一种简单的序列转导)

任务示例

  • 词性标注 POS Tagging
  • 命名实体识别 NER

隐马尔可夫解码 (维特比算法)

  • α t ( n ) = Pr ⁡ ( o 1 : t , q t = n ∣ θ ) \alpha_t(n) = \Pr(o_{1:t},q_t=n|\theta) αt(n)=Pr(o1:t,qt=nθ)
  • β t ( n ) = Pr ⁡ ( o t + 1 : T ∣ q t = n , θ ) \beta_t(n) = \Pr(o_{t+1:T}|q_t=n,\theta) βt(n)=Pr(ot+1:Tqt=n,θ)
  • γ t ( n ) = 1 Z α t ( n ) β t ( n ) \gamma_t(n) = \frac{1}{Z} \alpha_t(n) \beta_t(n) γt(n)=Z1αt(n)βt(n)
  • δ t ( n ) = max ⁡ q 1 : t − 1 Pr ⁡ ( o 1 : T , q 1 : t − 1 , q t = n ∣ θ ) = max ⁡ q 1 : t − 1 Pr ⁡ ( o 1 : t , q 1 : t − 1 , q t = n ∣ θ ) \delta_t(n) = \max\limits_{q_{1:t-1}} \Pr(o_{1:T},q_{1:t-1},q_t=n|\theta) = \max\limits_{q_{1:t-1}} \Pr(o_{1:t},q_{1:t-1},q_t=n|\theta) δt(n)=q1:t1maxPr(o1:T,q1:t1,qt=nθ)=q1:t1maxPr(o1:t,q1:t1,qt=nθ)   (陷入定点最大得分)
    1. δ 1 ( n ) = π ( n ) E ( n ↦ o 1 ) \delta_1(n) = \pi(n) E(n \mapsto o_1) δ1(n)=π(n)E(no1)
    2. ψ 1 ( n ) = ∅ \psi_1(n) = \varnothing ψ1(n)=
    3. δ t + 1 ( n ) = max ⁡ q t δ t ( q t ) T ( q t → n ) E ( n ↦ o t + 1 ) \delta_{t+1}(n) = \max\limits_{q_t} \delta_t(q_t) T(q_t \to n) E(n \mapsto o_{t+1}) δt+1(n)=qtmaxδt(qt)T(qtn)E(not+1)
    4. ψ t + 1 ( n ) = arg ⁡ max ⁡ q t δ t ( q t ) T ( q t → n ) E ( n ↦ o t + 1 ) \psi_{t+1}(n) = \arg \max\limits_{q_t} \delta_t(q_t) T(q_t \to n) \sout{E(n \mapsto o_{t+1})} ψt+1(n)=argqtmaxδt(qt)T(qtn)E(not+1)
    5. q T ∗ = arg ⁡ max ⁡ q T δ T ( q T ) q^*_T = \arg \max\limits_{q_T} \delta_T(q_T) qT=argqTmaxδT(qT)
    6. q t − 1 = ψ t ( q t ) q_{t-1} = \psi_t(q_t) qt1=ψt(qt)

条件随机场 (线性判别)

  • Pr ⁡ ( y ∣ x ) = 1 Z exp ⁡ ( w T ψ ( x , y ) ) \Pr(y|x) = \frac{1}{Z} \exp(w^T \psi(x,y)) Pr(yx)=Z1exp(wTψ(x,y))
    ψ k ( x , y ) = ψ k ( y t − 1 , y t , x , t ) \psi_k(x,y) = \psi_k(y_{t-1},y_{t},x,t) ψk(x,y)=ψk(yt1,yt,x,t)
  • 通过最大化对数似然训练参数.

神经网络

  • RNN, LSTM, GRU.
  • BERT.   (Attention, Transformer.)

句法分析

  • 乔姆斯基范式
    X → Y Z X \to YZ XYZ, X X X, Y Y Y, Z Z Z, 均为非终结符号.
    X → σ X \to \sigma Xσ, X X X为终结符号.
  • CYK算法
    对角线, 终结符号.
    距离对角线1的次对角线, 规约2次到达终结符号.
    距离对角线d的次对角线, 规约d次到达终结符号.
A X X K L
  B F H X
    C X X
      D G
        E

BC ~ F
DE ~ G
FD ~ H
AH ~ K
KE ~ L # 如果 KG ~ L, 那么 D 既属于 K 又属于 G.

神经网络

分布表示

  • word2vec CBOW (Continuous Bag of Words)
    • Pr ⁡ ( w i ∣ w o ; V , U ) = 1 Z exp ⁡ ( v i T u o ) \Pr(w_i|w_o;V,U) = \frac{1}{Z} \exp(v_i^Tu_o) Pr(wiwo;V,U)=Z1exp(viTuo)   ( i = { t } i=\{t\} i={t}, o = { t − w , ⋯   , t − 1 , t + 1 , ⋯   , t + w } o=\{t-w,\cdots,t-1,t+1,\cdots,t+w\} o={tw,,t1,t+1,,t+w}.)
    • Pr ⁡ ( w i ∣ w o ; V , U ) = σ ( v i T u o ) σ ( − v i t u ∗ ) \Pr(w_i|w_o;V,U) = \sigma(v_i^Tu_o) \sigma(-v_i^tu_*) Pr(wiwo;V,U)=σ(viTuo)σ(vitu)   (Negative Sampling)
  • word2vec Skip Gram
    • Pr ⁡ ( w o ∣ w i ; V , U ) \Pr(w_o|w_i;V,U) Pr(wowi;V,U)
  • GloVe
    • J = F i j ( v i T v j + b i + b j − log ⁡ F i j ) 2 J = F_{ij} (v_i^Tv_j + b_i + b_j - \log F_{ij})^2 J=Fij(viTvj+bi+bjlogFij)2

神经网络

  • 框架结构
    • J = C E ( S M ( f , y ) ) J = \mathrm{CE}(\mathrm{SM}(f,y)) J=CE(SM(f,y))
    • 链式法则计算梯度
    • 更新梯度   (ReLU, 残差连接, clip.)
  • RNN
    • h t = f ( W x x t + W h h t − 1 + b ) h_t = f(W^x x_t + W^h h_{t-1} + b) ht=f(Wxxt+Whht1+b)
    • Teacher Forcing
    • Bi RNN
    • Stacked RNN (Deep RNN)
  • LSTM
    g t = tanh ⁡ ( W x g x t + W h g h t − 1 + b g ) i t = σ ( W x i x t + W h i h t − 1 + b i ) f t = σ ( W x f x t + W h f h t − 1 + b f ) c t = f t ⊙ c t − 1 + i t ⊙ g t o t = σ ( W x o x t + W h o h t − 1 + b o ) h t = o t ⊙ tanh ⁡ ( c t ) \begin{aligned} g_t =& \tanh(W^{g}_x x_t + W^{g}_h h_{t-1} + b^{g}) \\ i_t =& \sigma(W^{i}_x x_t + W^{i}_h h_{t-1} + b^{i}) \\ f_t =& \sigma(W^{f}_x x_t + W^{f}_h h_{t-1} + b^{f}) \\ c_t =& f_t \odot c_{t-1} + i_t \odot g_t \\ o_t =& \sigma(W^{o}_x x_t + W^{o}_h h_{t-1} + b^{o}) \\ h_t =& o_t \odot \tanh(c_t) \\ \end{aligned} gt=it=ft=ct=ot=ht=tanh(Wxgxt+Whght1+bg)σ(Wxixt+Whiht1+bi)σ(Wxfxt+Whfht1+bf)ftct1+itgtσ(Wxoxt+Whoht1+bo)ottanh(ct)
    ∂ c t ∂ c t − 1 = f t + ( c t − 1 ∂ f t ∂ h t − 1 + i t ∂ g t ∂ h t − 1 + g t ∂ i t ∂ h t − 1 ) ∂ h t − 1 ∂ c t − 1 \frac{\partial c_t}{\partial c_{t-1}} = f_t + (c_{t-1} \frac{\partial f_t}{\partial h_{t-1}} + i_t \frac{\partial g_t}{\partial h_{t-1}} + g_t \frac{\partial i_t}{\partial h_{t-1}}) \frac{\partial h_{t-1}}{\partial c_{t-1}} ct1ct=ft+(ct1ht1ft+itht1gt+gtht1it)ct1ht1
  • Transformer (Attention)
    z ( 0 ) = [ P x 1 ⋯ P x N ] + Q p o s P ∈ R N × D Q p o s ∈ R N × D z ^ ( ℓ + 1 ) = M S A ( L N ( z ( ℓ ) ) ) + z ( ℓ ) ℓ = 0 ⋯ ( L − 1 ) z ( ℓ + 1 ) = M L P ( L N ( z ( ℓ ) ) ) + z ( ℓ ) ℓ = 0 ⋯ ( L − 1 ) \begin{aligned} z^{(0)} &= [P x_1 \cdots P x_N] + Q_{\mathrm{pos}} && P \in \reals^{N \times D} \quad Q_{\mathrm{pos}} \in \reals^{N \times D} \\ \hat{z}^{(\ell+1)} &= \mathrm{MSA}(\mathrm{LN}(z^{(\ell)})) + z^{(\ell)} && \ell = 0 \cdots (L-1) \\ z^{(\ell+1)} &= \mathrm{MLP}(\mathrm{LN}(z^{(\ell)})) + z^{(\ell)} && \ell = 0 \cdots (L-1) \\ \end{aligned} z(0)z^(+1)z(+1)=[Px1PxN]+Qpos=MSA(LN(z()))+z()=MLP(LN(z()))+z()PRN×DQposRN×D=0(L1)=0(L1)
    S A ( x ) = σ ( Q T K ⋅ M ) V \mathrm{SA}(x) = \sigma(Q^TK \cdot M) V SA(x)=σ(QTKM)V
    • Layer Normalization
    • Residual Connections
  • 预训练模型
    • ELMo (BiLSTM)
    • BERT (Transformer Encoder) (Masked Language Model / Next Sentence Prediction)

机器翻译

  • IBM Noisy Channel
    • arg ⁡ max ⁡ y p ( y ∣ x ) = arg ⁡ max ⁡ y p ( y ) p ( x ∣ y ) \arg \max\limits_{y} p(y|x) = \arg \max\limits_{y} p(y) p(x|y) argymaxp(yx)=argymaxp(y)p(xy)   (语言模型+翻译模型)
    • p ( y ) = ∏ t = 1 T p ( y t ∣ y 1 : t − 1 ) p(y) = \prod\limits_{t=1}^{T} p(y_t | y_{1:t-1}) p(y)=t=1Tp(yty1:t1)
    • p ( x ∣ y ) = ∑ a p ( x , a ∣ y ) p(x|y) = \sum\limits_{a} p(x,a|y) p(xy)=ap(x,ay)
    • p ( x , a ∣ y ) = p ( L ∣ y ) ∏ l = 1 L p ( a l ∣ a 1 : l − 1 , x l − 1 , L , y ) p ( x l ∣ a 1 : l , x l − 1 , L , y ) p(x,a|y) = p(L|y) \prod\limits_{l=1}^{L} p(a_l|a_{1:l-1},x_{l-1},L,y) p(x_l|a_{1:l},x_{l-1},L,y) p(x,ay)=p(Ly)l=1Lp(ala1:l1,xl1,L,y)p(xla1:l,xl1,L,y)   (长度模型+对齐模型+词汇模型)
    • p ( L ∣ y ) = c p(L|y) = c p(Ly)=c, p ( a l ∣ a 1 : l − 1 , x l − 1 , L , y ) = 1 T p(a_l|a_{1:l-1},x_{l-1},L,y) = \frac{1}{T} p(ala1:l1,xl1,L,y)=T1, p ( x l ∣ a 1 : l , x l − 1 , L , y ) = p ( x l ∣ y a l ) p(x_l|a_{1:l},x_{l-1},L,y) = p(x_l|y_{a_l}) p(xla1:l,xl1,L,y)=p(xlyal).
    • p ( x , a ∣ y ) = c T L ∏ l = 1 L p ( x l ∣ y a l ) p(x,a|y) = \frac{c}{T^L} \prod\limits_{l=1}^{L} p(x_l|y_{a_l}) p(x,ay)=TLcl=1Lp(xlyal)
    • p ( x ∣ y ) = c T L ∏ l = 1 L ∑ t = 1 T p ( x l ∣ y t ) p(x|y) = \frac{c}{T^L} \prod\limits_{l=1}^{L} \sum\limits_{t=1}^{T} p(x_l|y_t) p(xy)=TLcl=1Lt=1Tp(xlyt)
  • beam search 解码   (保留一定数量的前沿元素)
  • BLEU
    • p n = ∑ f ∑ g n ∈ f C y ( g n ) ∑ f ∑ g n ∈ f C ( g n ) p_n = \frac{\sum\limits_{f} \sum\limits_{g_n \in f} C_y(g_n)}{\sum\limits_{f} \sum\limits_{g_n \in f} C(g_n)} pn=fgnfC(gn)fgnfCy(gn)
    • B P = { 1 L f ⩾ L y exp ⁡ ( 1 − L y L f ) L f < L y \mathrm{BP} = \begin{cases} 1 & L_f \geqslant L_y \\ \exp(1 - \frac{L_y}{L_f}) & L_f < L_y \\ \end{cases} BP={1exp(1LfLy)LfLyLf<Ly
    • B L E U = B P p 1 p 2 p 3 p 4 \mathrm{BLEU} = \mathrm{BP} p_1 p_2 p_3 p_4 BLEU=BPp1p2p3p4

你可能感兴趣的:(自然语言处理,深度学习,机器学习)