逻辑回归

<机器学习笔记-04 >逻辑回归


关键词:机器学习,python,scikit-learn,逻辑回归,Latex

摘要:本文主要介绍了分类任务的分类,逻辑回归的概念,以及分类评估方法;同时介绍了如何使用python对分类任务进行建模、预测结果以及结果评价,以及使用GridSearch方法进行优化;

注:LaTeX常见命令请参考wikibooks,笔者将其pdf上传到csdn下载中。


  1. 知识要点总结
    1. 掌握概念:分类任务(二元分类、多类分类、多标签分类)
    2. 理解:广义线性回归概念,以及逻辑回归属于线性模型的原因;
    3. 理解分类评估方法:准确率、精确率、召回率、误警率、综合评价指标、ROC-AUC值、混淆矩阵、汉明损失函数,杰卡德相似度;
    4. 掌握使用python,针对二元分类和多类分类,训练逻辑回归模型,预测结果,对结果进行评价(各类分类评估方法);会使用GridSearch方法进行优化求解;
  2. 基本概念与理论分析
    1. 分类任务:目标是寻找一个函数,把观测值匹配到相应的类和标签上;逻辑回归(logistic regression)可以用来处理分类任务;常见分类可以分为二元分类(binary classification)、多类分类(multi-class classification)以及多标签分类(multi-label classification);
    2. F(t)=11+et

    3. 逻辑回归被认为是线性模型的原因(参考周志华《机器学习》):
      1. 线性模型(linear model)试图用线性组合进行预测的函数,即

        f(x)=w1x1+w2x2+...+wdxd+b

        用向量形式写为

        f(x)=ωTx+b

      2. 从逻辑函数可以推导得到

        lny1y=ωTx+b

        即:

        lnp(y=1|x)p(y=0|x)=ωTx+b

        概率关系可以用一组线性参数表示出来;
    4. 二元分类的效果评估方法
      1. 二元分类后样本的可能结果:结果预测正确为真,否则为假;
        1. 真阳性TP(true positives)
        2. 真阴性TN(true negatives)
        3. 假阴性FP(false positives)
        4. 假阳性FN(false negatives)
      2. 准确率(accuracy):预测为真的样本与总体样本之比;

        ACC=TP+TNTP+TN+FP+FN

      3. 精确率(precision):真阳性的样本与阳性样本之比;

        P=TPTP+FP

      4. 召回率(recall):真阳性样本与实际为阳性样本(真阳性+假阴性)之比;

        R=TPTP+FN

      5. 误警率(fall-out):=假阳性率;所有隐形样本中分类器识别为阳性的样本所占比例;

        F=FPTN+FP

      6. 综合评价指标(F1 measure):精确率与召回率的调和均值(harmonic mean);
        1. ##### F1公式

        1F1+1F1=1P+1R

        F1=2PRP+R

        F1=2TP2PP+FN+FP

        1. ##### 一般情况下公式

        Fβ=(1+β2)PRβ2P+R

        Fβ=(1+β2)TP(1+β2)TP+β2FN+FP

        1. F2: 召回率(R)权重大于精确率(P);F0.5:精确率(P)权重大于召回率;
      7. ROC值(receiver operating characteristic):画的是分类器的召回率(Y轴)与误警率(X轴)的曲线;
      8. AUC值(area under curve):ROC曲线下方的面积,表示分类器随机预测的效果;
      9. 混淆矩阵(confusion matrix):也叫列联表分析(contingency table),用来描述真假与阴阳的关系;矩阵的行表示实际类型,列表示预测类型;
      1. 用来确定最有超参数的方法;选取可能参数不断运行获取最有效果;
      2. 使用穷举法,计算量巨大;
    5. 多类分类:用one-vs-all或者one-vs-the-rest方法实现多类分类,就是把多类中的每个类都作为二元分类处理;分类器预测样本不同类型,将具有最大置信水平的类型作为样本类型
    6. 多标签分类:每个标签用二元分类处理,每个标签的分类器都预测样本是否属于该标签;
      1. 多标签分类效果评估方法:汉明损失函数(Hamming loss)和杰卡德相似度(Jaccard similarity);
      2. 汉明损失函数:表示错误标签的平均比例;是一个函数,当预测全部正确时,值为零;
      3. 杰卡德相似度:预测标签和真实标签的交集数量除以预测标签和真实标签的并集数量,值在{0,1}之间;
  3. Python编程积累
    1. Python基本命令
      1. print输出文字+数字的三种方法:{}%d
        print('spam:',df[df[0]=='spam'][0].count()) 
        
        #spam: 747
        
        print('spam:%d' %(df[df[0]=='spam'][0].count()))
        spam:747
        print('spam:{}' .format(df[df[0]=='spam'][0].count()))
        spam:747
      2. 将spam/ham元素的array和list转化为0/1元素的list
        
        y_test_binary=[0]*y_test.shape[0]
        predictions_binary=[0]*predictions.shape[0]
        for ind,val in enumerate(y_test):
         if val=='spam':
             y_test_binary[ind]=1
        for ind,val in enumerate(predictions):
         if val=='spam':
             predictions_binary[ind]=1
    2. Matplotlib
      1. confusion_matrix绘制成matshow(),添加colorbar()
        from sklearn.metrics import confusion_matrix
        import matplotlib.pyplot as plt
        plt.matshow(cm)
        plt.colorbar()
      2. 添加label
        import matplotlib.pyplot as plt
        plt.plot([0,1],[0,1],label='AUC')
        plt.legend(loc='lower right')

    3. Numpy
      1. 将区间划分arangelinsapce,前者是从起点按照给定步长进行划分,只有当终点也在步长整数倍时才会被包含在内;后者是将起点和终点中间等距划分,终点位最后一位数;
        X=np.arange(-6,6,5);X
        
        # 输出array([-6, -1,  4])
        
        X=np.arange(-6,6,1);X
        
        # 输出array([-6, -5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5])
        
        X=np.linspace(-6,6,5);X
        
        # 输出 array([-6., -3.,  0.,  3.,  6.])
        
    4. Pandas
      1. ==统计某列含某个值的数量
        print('含ham短信数量��', df[df[0] == 'ham'][0].count())
    5. skilearn
      1. train_size改变训练集和测试集的比例
        X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.5)
      2. 使用TfdifVectorizer计算TF-IDF权重
        from sklearn.feature_extraction.text import TfidfVectorizer
        vectorizer = TfidfVectorizer()
        X_train = vectorizer.fit_transform(X_train_raw)
        X_test = vectorizer.transform(X_test_raw)
      3. 使用LogisticRegression分类器进行训练和分类
        from sklearn.linear_model.logistic import LogisticRegression
        classifier = LogisticRegression()
        classifier.fit(X_train, y_train)
        predictions = classifier.predict(X_test)

  4. Python实例代码整理(来自课程中代码或者)
    1. 二元分类
      本实例为垃圾邮件分类:数据来自UCI Machine Learning Repository 的短信垃圾分类数据集��SMS Spam Classification Data Set��;
      1. 使用pandas库加载.csv文件,并做描述性统计
        import pandas as pd
        df=pd.read_csv('Desktop/data/SMSSpamCollection',delimiter='\t',header=None)
        print('count of spam message=\t',df[df[0]=='spam'][0].count())
        print('count of ham message=\t',df[df[0]=='ham'][0].count())
      2. 使用scikit-learn的train_test_split分成训练集(75%)和测试集(25%)
        from sklearn.cross_validation import train_test_split
        X_train_raw,X_test_raw,y_train,y_test=train_test_split(df[1],df[0])
      3. 建立TfidfVectorizer实例来计算输入信息的TF-IDF权重��;
        from sklearn.feature_extraction.text import TfidfVectorizer
        vectorizer=TfidfVectorizer()
        X_train=vectorizer.fit_transform(X_train_raw)
        X_test=vectorizer.transform(X_test_raw)
      4. 建立LogisticRegression分类器,来训练fit()和预测predict()模型;
        from sklearn.linear_model import LogisticRegression
        classifer=LogisticRegression()
        classifer.fit(X_train,y_train)
        predictions = classifer.predict(X_test)
        for i,prediction in enumerate(predictions[-10:]):
         print('predict-tpye=\t %s. message=\t %s.'%(prediction,X_test_raw.iloc[i]))
      5. y_testpredictions转化为0、1组成的list;
        y_test_binary=[0]*y_test.shape[0]
        predictions_binary=[0]*predictions.shape[0]
        for ind,val in enumerate(y_test):
         if val=='spam':
             y_test_binary[ind]=1
        
        for ind,val in enumerate(predictions):
         if val=='spam':
             predictions_binary[ind]=1
        print(y_test[-10:])
        print(predictions[-10:])
        print(y_test_binary[-10:])
        print(predictions_binary[-10:])
      6. 计算出confusion_matrix,输出并绘图
        from sklearn.metrics import confusion_matrix
        %matplotlib inline
        import matplotlib.pyplot as plt
        cf_mtx=confusion_matrix(y_test_binary,predictions_binary)
        print(cf_mtx)
        plt.matshow(cf_mtx);\
        plt.title('confusion matrix(spam=1)');\
        plt.colorbar();\
        plt.ylabel('actual type');\
        plt.xlabel('predict type');\
        plt.show()
      7. 计算模型预测的准确率accuracy_score,精确率precision_score,召回率recall_score,综合评价指标f1_score
        from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
        print('accuracy=\t',accuracy_score(y_test_binary,predictions_binary));\
        print('precision_score=\t',precision_score(y_test_binary,predictions_binary));\
        print('recall_score=\t',recall_score(y_test_binary,predictions_binary));\
        print('f1_score=\t',f1_score(y_test_binary,predictions_binary));\
        '''
        accuracy=        0.969849246231
        precision_score=         0.984375
        recall_score=    0.759036144578
        f1_score=        0.857142857143
        '''
      8. 计算模型交叉检验时的准确率accuracy_score,精确率precision_score,召回率recall_score,综合评价指标f1_score
        
        #restart kernel ctrl+.
        
        import numpy as np
        import pandas as pd
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.linear_model.logistic import LogisticRegression
        from sklearn.cross_validation import train_test_split,cross_val_score
        df=pd.read_csv('Desktop/data/sms.csv')
        X_train_raw,X_test_raw,y_train,y_test=train_test_split(df['message'],df['label'])
        vectorizer=TfidfVectorizer()
        X_train=vectorizer.fit_transform(X_train_raw)
        X_test=vectorizer.transform(X_test_raw)
        classifier=LogisticRegression()
        classifier.fit(X_train,y_train)
        accuracy_score=cross_val_score(classifier, X_train, y_train, cv=5)
        precision_score = cross_val_score(classifier, X_train, y_train, cv=5, scoring= 'precision')
        recall_score = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
        f1_score=cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
        print('accuracy=\t',np.mean(accuracy_score),accuracy_score)
        print('precisions=\t',np.mean(precision_score),precision_score)
        print('recall=\t',np.mean(recall_score),recall_score)
        print('f1=\t',np.mean(f1_score),f1_score)
        
        '''
        accuracy=        0.955502340314 [ 0.96535245  0.94384707  0.95933014  0.9497006   0.95928144]
        precisions=      0.989871815161 [ 0.97752809  0.97183099  1.          1.          1.        ]
        recall=  0.67899394504 [ 0.76315789  0.60526316  0.69911504  0.62831858  0.69911504]
        f1=      0.804132253371 [ 0.85714286  0.74594595  0.82291667  0.77173913  0.82291667]
        '''
      9. 计算ROC和AUC,并绘图
        import matplotlib.pyplot as plt
        from sklearn.metrics import roc_curve,auc
        %matplotlib inline
        
        predictions=classifier.predict_proba(X_test)
        false_positive_rate,recall,thresholds=roc_curve(y_test,predictions[:,1])
        roc_auc=auc(false_positive_rate,recall)
        plt.title('Reciver Operating Characteristic');\
        plt.plot(false_positive_rate,recall,'r',label='AUC=%0.2f' %roc_auc);\
        plt.legend(loc='lower right')
        plt.plot([0,1],[0,1],'k--');\
        plt.xlim([0.0,1.0]);\
        plt.ylim([0.0,1.0]);\
        plt.ylabel('Recall');\
        plt.xlabel('Fall-out');\
        plt.show()
    2. 使用GridSearchCV()确定最优超参数
      1. #####载入数据,分成训练集和测试集
      
      #RESTART KERNEL
      
      import pandas as pd
      from sklearn.cross_validation import train_test_split
      df=pd.read_csv('Desktop/data/sms.csv')
      X,y=df['message'],df['label']
      X_train,X_test,y_train,y_test=train_test_split(X,y)
      1. 设置GridSearchCV参数,求最优参数
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.linear_model.logistic import LogisticRegression
        from sklearn.grid_search import GridSearchCV
        from sklearn.pipeline import Pipeline
        pipeline = Pipeline([ ('vect', TfidfVectorizer(stop_words='english')), ('clf', LogisticRegression())])
        parameters = {'vect__max_df': (0.25, 0.5, 0.75), 'vect__stop_words': ('english', None), 'vect__max_features': (2500, 5000, 10000, None), 'vect__ngram_range': ((1, 1), (1, 2)), 'vect__use_idf': (True, False),'vect__norm': ('l1', 'l2'),'clf__penalty': ('l1', 'l2'),'clf__C': (0.01, 0.1, 1, 10),}
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
        grid_search.fit(X_train, y_train)
        '''output:
        Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
        [Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    1.7s
        [Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:   11.2s
        [Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   32.1s
        [Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:  1.0min
        [Parallel(n_jobs=-1)]: Done 2156 tasks      | elapsed:  1.5min
        [Parallel(n_jobs=-1)]: Done 2706 tasks      | elapsed:  1.9min
        [Parallel(n_jobs=-1)]: Done 3356 tasks      | elapsed:  2.4min
        [Parallel(n_jobs=-1)]: Done 4106 tasks      | elapsed:  4.1min
        [Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed:  4.6min finished
        '''
      2. 输出参数组合、精确率、准确率、召回率
        print('best-accuracy=\t',grid_search.best_score_);
        print('best paras combination');
        best_paras=grid_search.best_estimator_.get_params();
        for para_name in sorted(parameters.keys()):
            print('\t%s=\t%r'%(para_name,best_paras[para_name]))
        predictions=grid_search.predict(X_test)
        print('accuracy=',accuracy_score(y_test,predictions))
        print('precision=',precision_score(y_test,predictions))
        print('recall=',recall_score(y_test,predictions))
        '''
        best-accuracy=   0.984210526316
        best paras combination
            clf__C= 10
            clf__penalty=   'l2'
            vect__max_df=   0.25
            vect__max_features=     None
            vect__ngram_range=      (1, 2)
            vect__norm=     'l2'
            vect__stop_words=       None
            vect__use_idf=  True
            accuracy= 0.982065997131
            precision= 0.977272727273
            recall= 0.891191709845
        '''
    3. 多标签分类
      1. 载入数据,预览数据
        import pandas as pd
        df=pd.read_csv('Desktop/Sentiment Analysis on Movie Reviews/train.tsv',header=0,delimiter='\t')
        df.head()
        df.count()
        df.Phrase.head()
        df.Sentiment.describe()
        df.Sentiment.value_counts()
        df.Sentiment.value_counts()/df.Sentiment.count()
      2. 划分训练集和测试集
        from sklearn.cross_validation import train_test_split
        X,y=df['Phrase'],df['Sentiment'].as_matrix()
        X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.5)
      3. 设置GridSearch参数,并求解
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.linear_model.logistic import LogisticRegression
        from sklearn.pipeline import Pipeline
        from sklearn.grid_search import GridSearchCV
        pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression())])
        paras={'vect__max_df':(0.25,0.5),'vect__ngram_range':((1,1),(1,2)),'vect__use_idf':(True,False),'clf__C':(0.1,1,10)}
         grid_search=GridSearchCV(pipeline,paras,n_jobs=3,verbose=1,scoring='accuracy')
        grid_search.fit(X_train,y_train)
      4. 输出参数组合
        print('best accuracy:=',grid_search.best_score_)
        best_paras = grid_search.best_estimator_.get_params()
        print('best paras:');
        for para_name in sorted(paras.keys()):
          print('\t%s: %r' %(para_name,best_paras[para_name]));
        
        '''
         best accuracy:= 0.619735998975
         best paras:
                 clf__C: 10
                 vect__max_df: 0.25
                 vect__ngram_range: (1, 2)
                 vect__use_idf: False
        '''
      5. 输出多类分类效果评估
        from sklearn.metrics import classification_report, accuracy_score,confusion_matrix
        predictions=grid_search.predict(X_test)
        print('accuracy=',accuracy_score(y_test,predictions));
        print('confusion matrix=',confusion_matrix(y_test,predictions))
        print('report=',classification_report(y_test,predictions))
         '''
         accuracy= 0.635101883891
         confusion matrix= [[ 1165  1680   682    67    10]
        [  894  5990  6175   557    39]
        [  199  3219 32596  3611   162]
        [   25   424  6562  8117  1248]
        [    2    33   502  2382  1689]]
        report=              precision    recall  f1-score   support
               0       0.51      0.32      0.40      3604
               1       0.53      0.44      0.48     13655
               2       0.70      0.82      0.76     39787
               3       0.55      0.50      0.52     16376
               4       0.54      0.37      0.44      4608
        avg / total       0.62      0.64      0.62     78030
        '''

你可能感兴趣的:(机器学习,machine,learning)