DataWhale

Day two

  • LOGISTIC REGRESSION
    • Linear regression & Logistic regression
    • The principle of logistic regression
      • loss function
      • optimization
    • Regularization
    • Model evaluation index
    • Advantages
    • Disadvantages
    • Sample imbalance issue
    • sklearn parameters
  • REFERENCE

LOGISTIC REGRESSION

Linear regression & Logistic regression

  • The purpose of linear regression is to predict a continuous variable y y y with input x x x.
  • Logistic regression mainly means to classify different categories with input x x x.

The principle of logistic regression

LIke linear regression, there is also a predictive function h θ ( x ) h_\theta(x) hθ(x)(called classification function, linear or non-linear function) in logistic regression.

  • first, compute the predictive function

  • second, call the sigmoid function

  • finally, get the loss function and optimize parameters

    loss function

    J ( θ ) = 1 N ∑ i = 1 N C o s t ( h θ ( x ) , y ) J(\theta)=\frac{1}{N}\sum_{i=1}^NCost(h_\theta(x),y) J(θ)=N1i=1NCost(hθ(x),y)

    optimization

    1. batch gradient descent
      θ = θ + α ∗ 1 N ∑ i = 1 N ( y i − g ( X θ ) ) \theta=\theta+\alpha*\frac{1}{N}\sum_{i=1}^N(y_i-g(X_\theta)) θ=θ+αN1i=1N(yig(Xθ))
      (repeat until convergence)

    2. stochastic gradient descent
      for i to N:
      θ = θ + α ∗ ( y i − h θ ( x ( i ) ) ) x j ( i ) \theta=\theta+\alpha*(y_i-h_\theta(x^{(i)}))x_j^{(i)} θ=θ+α(yihθ(x(i)))xj(i)
      (repeat until convergence)

Regularization

J ( θ ) = 1 N ∑ i = 1 N y ( i ) l o g h θ ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) + λ 2 N ∑ j = 1 N θ j 2 J(\theta)=\frac{1}{N}\sum_{i=1}^Ny^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))+\frac{\lambda}{2N}\sum_{j=1}^N\theta_j^2 J(θ)=N1i=1Ny(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))+2Nλj=1Nθj2

Model evaluation index

  • receiver operating characteristic curve(roc)
    the horizontal axis: false positive rate(fpr)
    the vertical axis: true positive rate(tpr)
    every point corresponds to a thresthold

  • area under curve(auc)
    auc is related to roc

DataWhale_第1张图片

Advantages

  1. easy to compute, understand and implement
  2. efficient in time and memory
  3. robustness is good for small noise in data

Disadvantages

  1. underfitting happens easily
  2. classification accuracy is not high
  3. not work well for a big feature space

Sample imbalance issue

The logistic regression model ignores the feature of small sample category.

  • solutions
  1. oversampling, undersampling and combined sampling
  2. weight adjustment
  3. kernel function correction
  4. model correction

sklearn parameters

sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
  1. regularization parameter: penalty(‘l1’ or ‘l2’)
  2. optimization: solver(‘liblinear’ ‘lbfgs’ ‘newton-cg’ ‘sag’)
  3. classification: multi_class(‘ovr’ or ‘multinomial’)
  4. class weight: class_weight
  5. sample weight: sample_weight

REFERENCE

[1] https://blog.csdn.net/touch_dream/article/details/79371462

[2] https://yoyoyohamapi.gitbooks.io/mit-ml/content/逻辑回归/articles/利用正规化解决过拟合问题.html

[3] https://www.cnblogs.com/dlml/p/4403482.html

[4] https://blog.csdn.net/u011088579/article/details/80654165

[5] https://blog.csdn.net/sun_shengyun/article/details/53811483

你可能感兴趣的:(DataWhale)