【Machine Learning】Unsupervised Learning

本笔记基于清华大学《机器学习》的课程讲义无监督学习相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Principle Component Analysis

  • Dimension reductio: JL lemma d = Ω ( log ⁡ n ϵ 2 ) d=\Omega\left(\frac{\log n}{\epsilon^2}\right) d=Ω(ϵ2logn) to remain the distance of n n n data points.
  • Goal of PCA
    • maximize variance: E [ ( v ⊤ x ) 2 ] = v ⊤ X X ⊤ v \mathbb{E}[(v^\top x)^2]=v^\top XX^\top v E[(vx)2]=vXXv for ∣ v ∣ = 1 |v|=1 v=1
    • minimize reconstruction error: E [ ∣ x − ( v ⊤ x ) v ∣ 2 ] \mathbb{E}[|x-(v^\top x)v|^2] E[x(vx)v2]
  • Find v i v_i vi iteratively, project data points onto subspace expanded by v 1 , v 2 , . . , v d v_1,v_2,..,v_d v1,v2,..,vd
  • How to find v v v ?
    • Eigen decomposition: X X ⊤ = U Σ U ⊤ XX^\top =U\Sigma U^\top XX=UΣU
    • v 1 v_1 v1 is the eigenvector of maximum eigenvalue.
    • Power method

Nearest Neighbor Classification

  • KNN: K-nearest neighbor
  • nearest neighbor search: Locality sensitive hashing algorithm(LSH)*
    • Randomized c c c-approximate R R R-near neighbor( ( c , R ) (c,R) (c,R)-NN): A data structure that at least gives a c R cR cR neighbor in some probability if R R R neighbor exists.
    • A family H H H is called ( R , c R , P 1 , P 2 ) (R,cR,P_1,P_2) (R,cR,P1,P2)-sensitive if for any p , q ∈ R d p,q\in \mathbb{R}^d p,qRd
      • if ∣ p − q ∣ ≤ R |p-q|\le R pqR, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≥ P 1 \Pr_H[h(q)=h(p)]\ge P_1 PrH[h(q)=h(p)]P1
      • if ∣ p − q ∣ ≥ c R |p-q|\ge cR pqcR, then Pr ⁡ H [ h ( q ) = h ( p ) ] ≤ P 1 \Pr_H[h(q)=h(p)]\le P_1 PrH[h(q)=h(p)]P1
      • P 1 > P 2 P_1>P_2 P1>P2
    • Algroithm based on LSH family:
      • Construct g i ( x ) = ( h i , 1 ( x ) , h i , 2 ( x ) , . . . , h i , k ( x ) ) , 1 ≤ i ≤ L g_i(x)=(h_{i,1}(x),h_{i,2}(x),...,h_{i,k}(x)),1\le i\le L gi(x)=(hi,1(x),hi,2(x),...,hi,k(x)),1iL. All h i , j h_{i,j} hi,j are iid from H H H.
      • Check the element in the bucket of g i ( q ) g_i(q) gi(q), whether it’s c R cR cR-near neighbor of q q q. Until we check 2 L + 1 2L+1 2L+1 times.
      • if R R R neighbor exists, w.p. at least 1 2 − 1 e \frac{1}{2}-\frac{1}{e} 21e1 find c R cR cR-neighbor
      • ρ = log ⁡ 1 / P 1 log ⁡ 1 / P 2 , k = log ⁡ 1 / P 2 ( n ) , L = n ρ \rho=\frac{\log 1/P_1}{\log 1/P_2},k=\log_{1/P_2}(n),L=n^\rho ρ=log1/P2log1/P1,k=log1/P2(n),L=nρ
      • Proof

Metric Learning

  • project x i x_i xi into f ( x i ) f(x_i) f(xi)
  • Hard version(compare label of its neighbor)- soft version
  • Neighborhood Component Analysis(NCA)
    • p i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 ) p_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2) pi,jexp(f(xi)f(xj)2)
    • maximize ∑ i ∑ j ∈ C i p i , j \sum_{i}\sum_{j\in C_i}p_{i,j} ijCipi,j
  • LMNN: L = max ⁡ ( 0 , ∥ f ( x ) − f ( x + ) ∥ 2 − ∥ f ( x ) − f ( x − ) ∥ 2 + r ) L=\max(0,\|f(x)-f(x^+)\|_2-\|f(x)-f(x^-)\|_2+r) L=max(0,f(x)f(x+)2f(x)f(x)2+r)
    • x + , x − x^+,x^- x+,x are worst cases.
    • r is margin

Spectral Cluster

  • K-means
  • Spectral graph clustring
    • Graph laplacian: L = D − A L=D-A L=DA, A A A represents the similarity.
    • #zero eigenvalue = # connected component
    • Smallest k k k eigenvectors gives a partition of k k k clusters, do k k k-means on the row
    • Ratio cut can be transfered into finding the k k k smallest eigenvectors, which is the same as graph laplacian.

SimCLR*

  • Intelligence is positioning

  • InfoNCE loss
    L ( q , p 1 , { p i } i = 2 N ) = − log ⁡ exp ⁡ ( − ∥ f ( q ) − f ( p 1 ) ∣ 2 / ( 2 τ ) ∑ i = 1 N exp ⁡ ( − ∥ f ( q ) − f ( p i ) ∣ 2 / ( 2 τ ) L(q,p_1,\{p_i\}_{i=2}^N)=-\log \frac{\exp(-\|f(q)-f(p_1)|^2/(2\tau)}{\sum_{i=1}^{N}\exp(-\|f(q)-f(p_{i})|^2/(2\tau)} L(q,p1,{pi}i=2N)=logi=1Nexp(f(q)f(pi)2/(2τ)exp(f(q)f(p1)2/(2τ)

  • Learn Z = f ( x ) Z=f(x) Z=f(x): map original data points into a space that semantic similarity is captured naturally.

    • Reproducing kernel Hilbert space: k ( f ( x 1 ) , f ( x 2 ) ) = ⟨ ϕ ( f ( x 1 ) ) , ϕ ( f ( x 2 ) ) ⟩ H k(f(x_1),f(x_2))=\langle\phi(f(x_1)),\phi(f(x_2))\rangle_H k(f(x1),f(x2))=ϕ(f(x1)),ϕ(f(x2))H. Inner product is a kernel function.
    • Usually, K Z , i , j = k ( Z i − Z j ) K_{Z,i,j}=k(Z_i-Z_j) KZ,i,j=k(ZiZj), k k k is gaussian.
  • We have a similarity matrix π \pi π about the dataset previously. π i , j \pi_{i,j} πi,j is the similarity of data i i i and j j j. We want the similarity matrix K Z K_Z KZ of f ( x ) f(x) f(x) is the same as that of x x x which is given manually. Let W X ∼ π , W Z ∼ K Z W_X\sim \pi,W_Z\sim K_Z WXπ,WZKZ, we want these two samples are the same.

  • Minimize crossentropy loss: H π k ( Z ) = − E W X ∼ P ( ⋅ ; π ) [ log ⁡ P ( W Z = W X ; K Z ) ] H_{\pi}^{k}(Z)=-\mathbb{E}_{W_X\sim P(\cdot ;\pi)}[\log P(W_Z=W_X;K_Z)] Hπk(Z)=EWXP(;π)[logP(WZ=WX;KZ)]

    • Equivalent to InfoNCE loss: Only care about row i i i, infoNCE loss is log ⁡ ( W Z , i = W X , i ) \log(W_{Z,i}=W_{X,i}) log(WZ,i=WX,i). The given pair q , p 1 q,p_1 q,p1 are sampled from similarity matrix π \pi π, which corresponds to W X ∼ P ( ⋅ ; π ) W_X\sim P(\cdot;\pi) WXP(;π).
    • Equivalent to spectral clustering: equaivalent to arg ⁡ min ⁡ Z t r ( Z ⊤ L ∗ Z ) \arg \min_Ztr(Z^\top L^*Z) argminZtr(ZLZ)

t-SNE

  • data visualization: map data into low dimension space(2D)

  • SNE: Same as NCA, want q i , j ∼ exp ⁡ ( − ∥ f ( x i ) − f ( x j ) ∥ 2 / ( 2 σ 2 ) ) q_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2/(2\sigma^2)) qi,jexp(f(xi)f(xj)2/(2σ2)) to be similar to p i , j ∼ exp ⁡ ( − ∥ x i − x j ∥ 2 / ( 2 σ i 2 ) ) p_{i,j}\sim \exp (-\|x_i-x_j\|^2/(2\sigma_i^2)) pi,jexp(xixj2/(2σi2))

    • CrossEntropy loss − p i , j ⋅ log ⁡ q i , j p i , j -p_{i,j}\cdot \log \frac{q_{i,j}}{p_{i,j}} pi,jlogpi,jqi,j
  • Crowding problem

  • Solved by t-SNE: let q i , j ∼ ( 1 + ∥ y j − y i ∥ 2 ) − 1 q_{i,j}\sim (1+\|y_j-y_i\|^2)^{-1} qi,j(1+yjyi2)1(student t-distribution)

    • The power − 1 -1 1 is more heavy tail than Gaussian, then we can solve the crowding problem by shifting the distance.

你可能感兴趣的:(学习小计,机器学习,人工智能,无监督学习,t-SNE,SimCLR,PCA,聚类)