约束优化-拉格朗日乘子法

约束优化-拉格朗日乘子法

拉格朗日乘子法(Lagrange multipliers)是一种寻找多元函数在一组约束下的极值方法。通过引入拉格朗日乘子,可将有 d d d个变量与 k k k个约束条件的最优化问题转化为具有 d + k d+k d+k个变量的无约束优化问题求解

一、原始问题

  1. 假设 x \mathbf x x d d d维向量,,欲寻找 x \mathbf x x的某个取值 x ∗ \mathbf x^* x,使目标函数 f ( x ) f(\mathbf x) f(x)最小且满足 m m m个等式约束和 n n n个不等式约束,且可行域 D ⊂ R d \mathbb D \subset \mathbb R^d DRd非空的优化问题

    (1) m i n x   f ( x ) s . t . h i ( x ) ⩽ 0   ( i = 1 , 2 , … , m )           g j ( x ) = 0   ( j = 1 , 2 , … , n ) \underset{\mathbf x}{min}\,f(\mathbf x) \\ s.t. h_i(\mathbf x)\leqslant 0\ (i=1, 2, …, m) \\ \ \ \ \ \ \ \,\,g_j(\mathbf x) = 0 \ (j=1, 2, …, n) \tag{1} xminf(x)s.t.hi(x)0 (i=1,2,,m)      gj(x)=0 (j=1,2,,n)(1)

  2. 引入拉格朗日乘子 α = ( α 1 , α 2 , … , α m ) T \alpha = (\alpha_1, \alpha_2, …, \alpha_m)^T α=(α1,α2,,αm)T β = ( β 1 , β 2 , … , β n ) T \beta = (\beta_1, \beta_2, …, \beta_n)^T β=(β1,β2,,βn)T,相应的拉格朗日函数为

    (2) L ( x , α , β ) = f ( x ) + ∑ i = 1 m   α i h i ( x ) + ∑ j = 1 n β j g j ( x ) L(\mathbf x,\mathbf \alpha,\mathbf \beta) = f(\mathbf x) + \sum_{i=1}^{m}\ \alpha_i h_i(\mathbf x) + \sum_{j=1}^{n} \beta_j g_j(\mathbf x) \tag{2} L(x,α,β)=f(x)+i=1m αihi(x)+j=1nβjgj(x)(2)

    这里 x = ( x 1 , x 2 , … , x d ) T ∈ R d \mathbf x = (x_1, x_2, …,x_d)^T \in \mathbb R^d x=(x1,x2,,xd)TRd α i ⩾ 0 \alpha_i \geqslant 0 αi0

    假设 f ( x ) f(\mathbf x) f(x) h i ( x ) h_i(\mathbf x) hi(x) g j ( x ) g_j(\mathbf x) gj(x)是定义在 R d \mathbb R^d Rd上的连续可微函数

    L ( x , α , β ) L(\mathbf x,\mathbf \alpha,\mathbf \beta) L(x,α,β) x , α , β \mathbf x,\mathbf \alpha,\mathbf \beta x,α,β的多元非线性函数

  3. 定义函数:

    (3) θ P ( x ) = m a x α , β : α i ⩾ 0   L ( x , α , β ) \theta_P(\mathbf x) = \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\ L(\mathbf x, \mathbf \alpha, \mathbf \beta) \tag{3} θP(x)=α,β:αi0max L(x,α,β)(3)

    其中下标 P P P表示原始问题。则有:
    (4) θ P ( x ) = { f ( x ) ,   i f   x   s t a t i s f y   o r i g i n a l   p r o b l e m ′ s   c o n s t r a i n t + ∞ ,   o r   e l s e \theta_P(\mathbf x) = \begin{cases} f(\mathbf x),\ if \ \mathbf x\ statisfy\ original\ problem's\ constraint \\ +\infty, \ or\ else \tag{4} \end{cases} θP(x)={f(x), if x statisfy original problems constraint+, or else(4)

    • x \mathbf x x 满足原问题的约束,则很容易证明 L ( x , α , β ) = f ( x ) + ∑ i = 1 m α i h i ( x ) ⩽ f ( x ) L(\mathbf x,\alpha,\mathbf \beta) = f(\mathbf x) + \sum_{i=1}^{m} \alpha_i h_i(\mathbf x) \leqslant f(\mathbf x) L(x,α,β)=f(x)+i=1mαihi(x)f(x),等号在 α i = 0 \alpha_i = 0 αi=0时成立

    • x \mathbf x x 不满足原问题的约束:

      • 若不满足 h i ( x ) ⩽ 0 h_i(\mathbf x) \leqslant 0 hi(x)0:设违反的为 h i ( x ) > 0 h_{i}(\mathbf x) > 0 hi(x)>0,则令 α i → ∞ \alpha_i → \infty αi,有:

        L ( x , α , β ) = f ( x ) + ∑ i = 1 m α i h i ( x ) → ∞ L(\mathbf x,\mathbf \alpha,\mathbf \beta) = f(\mathbf x) + \sum_{i=1}^{m}\mathbf \alpha_i h_i(\mathbf x) → \infty L(x,α,β)=f(x)+i=1mαihi(x)

      • 若不满足 g j ( x ) = 0 g_j(\mathbf x) = 0 gj(x)=0:设违反的为 g j ( x ) ≠ 0 g_j(\mathbf x) \neq 0 gj(x)̸=0,则令 β j g j ( x ) → ∞ \beta_j g_j(\mathbf x) → \infty βjgj(x),有:

        L ( x , α , β ) = f ( x ) + ∑ i = 1 m   α i h i ( x ) + ∑ j = 1 n β j g j ( x ) → ∞ L(\mathbf x,\mathbf \alpha,\mathbf \beta) = f(\mathbf x) + \sum_{i=1}^{m}\ \alpha_i h_i(\mathbf x) + \sum_{j=1}^{n} \beta_j g_j(\mathbf x) → \infty L(x,α,β)=f(x)+i=1m αihi(x)+j=1nβjgj(x)

  4. 考虑极小化问题:

    (5) m i n x   θ P ( x ) = m i n x m a x α , β : α i ⩾ 0   L ( x , α , β ) \underset{\mathbf x}{min} \, \theta_P(\mathbf x) = \underset{\mathbf x}{min} \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\ L(\mathbf x, \mathbf \alpha, \mathbf \beta) \tag{5} xminθP(x)=xminα,β:αi0max L(x,α,β)(5)

    则该问题与原始最优化问题是等价的,即他们有相同的问题

    • m i n x m a x α , β : α i ⩾ 0   L ( x , α , β ) \underset{\mathbf x}{min} \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\ L(\mathbf x, \mathbf \alpha, \mathbf \beta) xminα,β:αi0max L(x,α,β)称为广义拉格朗日函数的极大极小问题
    • 为了方便讨论定义原始问题的最优值为: p ∗ = m i n x   θ P ( x ) p^* = \underset{\mathbf x}{min} \, \theta_P(\mathbf x) p=xminθP(x)

二、对偶问题

  1. 定义 θ D ( α , β ) = m i n x   L ( x , α , β ) \theta_D (\mathbf \alpha, \mathbf \beta)= \underset{\mathbf x}{min}\, L(\mathbf x, \mathbf \alpha, \mathbf \beta) θD(α,β)=xminL(x,α,β),考虑极大化 θ D ( α , β ) \theta_D (\mathbf \alpha, \mathbf \beta) θD(α,β),即:

    (6) m a x α , β : α i ⩾ 0   θ D ( α , β ) = m a x α , β : α i ⩾ 0   m i n x   L ( x , α , β ) \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\,\theta_D (\mathbf \alpha, \mathbf \beta) = \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\, \underset{\mathbf x}{min}\, L(\mathbf x, \mathbf \alpha, \mathbf \beta) \tag{6} α,β:αi0maxθD(α,β)=α,β:αi0maxxminL(x,α,β)(6)

    问题 m a x α , β : α i ⩾ 0   m i n x   L ( x , α , β ) \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\, \underset{\mathbf x}{min}\, L(\mathbf x, \mathbf \alpha, \mathbf \beta) α,β:αi0maxxminL(x,α,β)称为广义拉格朗日函数的极大极小问题。它可以表述为约束最优化问题:

    (7) m a x α , β : α i ⩾ 0   θ D ( α , β ) = m a x α , β : α i ⩾ 0   m i n x   L ( x , α , β ) s . t . α i ⩾ 0 , i = 1 , 2 , . . , k \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\,\theta_D (\mathbf \alpha, \mathbf \beta) = \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\, \underset{\mathbf x}{min}\, L(\mathbf x, \mathbf \alpha, \mathbf \beta) \\ s.t. \alpha_i \geqslant 0,i=1,2,..,k \tag{7} α,β:αi0maxθD(α,β)=α,β:αi0maxxminL(x,α,β)s.t.αi0,i=1,2,..,k(7)

    称为原始问题的对偶问题。

    为了方便讨论,定义对偶问题的最优值为: d ∗ = m a x α , β : α i ⩾ 0   θ D ( α , β ) d^* = \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\,\theta_D (\mathbf \alpha, \mathbf \beta) d=α,β:αi0maxθD(α,β)

  2. 定理一:若原问题和对偶问题具有最优值,则:

    (8) d ∗ = m a x α , β : α i ⩾ 0   m i n x   L ( x , α , β ) ⩽ m i n x m a x α , β : α i ⩾ 0   L ( x , α , β ) = p ∗ d^* = \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\, \underset{\mathbf x}{min}\, L(\mathbf x, \mathbf \alpha, \mathbf \beta) \leqslant \underset{\mathbf x}{min} \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\ L(\mathbf x, \mathbf \alpha, \mathbf \beta) = p^* \tag{8} d=α,β:αi0maxxminL(x,α,β)xminα,β:αi0max L(x,α,β)=p(8)

    • 推论一:设 x ∗ \mathbf x^* x为原始问题的可行解,且 θ P ( x ∗ ) \theta_P(\mathbf x^*) θP(x)的值为 p ∗ p^* p α ∗ , β ∗ \mathbf \alpha^*,\mathbf \beta^* α,β为对偶问题的可行解, θ D ( α ∗ , β ∗ ) \theta_D (\mathbf \alpha^*, \mathbf \beta^*) θD(α,β)值为 d ∗ d^* d

      如果有 p ∗ = d ∗ p^* = d^* p=d,则 x ∗ , α ∗ , β ∗ \mathbf x^*,\mathbf \alpha^*,\mathbf \beta^* x,α,β分别为原始问题和对偶问题的最优解

  3. 定理二:假设函数 f ( x ) f(\mathbf x) f(x) h i ( x ) h_i(\mathbf x) hi(x)为凸函数, g j ( x ) g_j(\mathbf x) gj(x) 是仿射函数;并且假设不等式约束 h i ( x ) h_i(\mathbf x) hi(x)是严格可行的,即存在 x \mathbf x x,对于所有 i i i h i ( x ) < 0 h_i(\mathbf x) < 0 hi(x)<0。则存在 x ∗ , α ∗ , β ∗ \mathbf x^*,\mathbf \alpha^*,\mathbf \beta^* x,α,β,使得: x \mathbf x x是原始问题 m i n x   θ P ( x ) \underset{\mathbf x}{min} \, \theta_P(\mathbf x) xminθP(x)的解, α ∗ , β ∗ \mathbf \alpha^*,\mathbf \beta^* α,β是对偶问题 m a x α , β : α i ⩾ 0   θ D ( α , β ) \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\,\theta_D (\mathbf \alpha, \mathbf \beta) α,β:αi0maxθD(α,β)的解,并且 p ∗ = d ∗ = L ( x ∗ , α ∗ , β ∗ ) p^*=d^* = L(\mathbf x^*,\mathbf \alpha^*,\mathbf \beta^*) p=d=L(x,α,β)

  4. 定理三:假设函数 f ( x ) f(\mathbf x) f(x) h i ( x ) h_i(\mathbf x) hi(x)为凸函数, g j ( x ) g_j(\mathbf x) gj(x) 是仿射函数;并且假设不等式约束 h i ( x ) h_i(\mathbf x) hi(x)是严格可行的,即存在 x \mathbf x x,对于所有 i i i h i ( x ) < 0 h_i(\mathbf x) < 0 hi(x)<0。则存在 x ∗ , α ∗ , β ∗ \mathbf x^*,\mathbf \alpha^*,\mathbf \beta^* x,α,β,使得: x \mathbf x x是原始问题 m i n x   θ P ( x ) \underset{\mathbf x}{min} \, \theta_P(\mathbf x) xminθP(x)的解, α ∗ , β ∗ \mathbf \alpha^*,\mathbf \beta^* α,β是对偶问题 m a x α , β : α i ⩾ 0   θ D ( α , β ) \underset{\mathbf \alpha,\mathbf \beta:\alpha_i\geqslant 0}{max}\,\theta_D (\mathbf \alpha, \mathbf \beta) α,β:αi0maxθD(α,β)的解的充要条件是: x ∗ , α ∗ , β ∗ \mathbf x^*,\mathbf \alpha^*,\mathbf \beta^* x,α,β满足下面的**Karush-kuhn-Tucker(KKT)**条件:
    ∇ x = L ( x ∗ , α ∗ , β ∗ ) ∇ α = L ( x ∗ , α ∗ , β ∗ ) ∇ β = L ( x ∗ , α ∗ , β ∗ ) α i ∗ h i ∗ ( x ∗ ) = 0 , i = 1 , 2 , . . . , k h i ∗ ( x ∗ ) ⩽ 0 , i = 1 , 2 , . . . , k α i ∗ ⩾ 0 , i = 1 , 2 , . . . , k g j ∗ ( x ∗ ) = 0 , j = 1 , 2 , . . . , k \nabla_\mathbf x = L(\mathbf x^*,\mathbf \alpha^*,\mathbf \beta^*) \\ \nabla_\mathbf \alpha = L(\mathbf x^*,\mathbf \alpha^*,\mathbf \beta^*) \\ \nabla_\mathbf \beta = L(\mathbf x^*,\mathbf \alpha^*,\mathbf \beta^*) \\ \alpha_i^* h_i^*(\mathbf x^*) = 0,i=1,2,...,k \\ h_i^*(\mathbf x^*) \leqslant 0 ,i=1,2,...,k\\ \alpha_i^* \geqslant 0 ,i=1,2,...,k \\ g_j^*(\mathbf x^*) = 0,j=1,2,...,k x=L(x,α,β)α=L(x,α,β)β=L(x,α,β)αihi(x)=0,i=1,2,...,khi(x)0,i=1,2,...,kαi0,i=1,2,...,kgj(x)=0,j=1,2,...,k

  5. 仿射函数:仿射函数即由1阶多项式构成的函数。

    一般形式为: f ( x ) = A x + b f(\mathbf x) = \mathbf A \mathbf x + b f(x)=Ax+b。这里: A \mathbf A A是一个 m × n m \times n m×n矩阵, x \mathbf x x是一个 k k k维列向量, b b b是一个 m m m维列向量,它实际上反映了一种从 k k k维到 m m m维的空间线性映射关系。

  6. 凸函数:设 f f f为定义在区间 X \mathcal{X} X上的函数,若对 X \mathcal{X} X上的任意两点 x 1 , x 2 \mathbf x_1,\mathbf x_2 x1,x2和任意的实数 λ ∈ ( 0 , 1 ) \lambda \in (0, 1) λ(0,1) ,总有 f ( λ x 1 + ( 1 − λ ) x 2 ) ⩾ λ f ( x 1 ) + ( 1 − λ ) f ( x 2 ) f(\lambda\mathbf x_1 + (1-\lambda)\mathbf x_2) \geqslant \lambda f(\mathbf x_1) + (1-\lambda)f(\mathbf x_2) f(λx1+(1λ)x2)λf(x1)+(1λ)f(x2),则 f f f称为 X \mathcal{X} X上的凸函数 。

参考:

  • 《机器学习》
  • AI算法工程师手册

你可能感兴趣的:(机器学习)