【投稿】Machine Learing With Spark Note 3：构建分类器

本文为数盟特约作者投稿，欢迎转载，请注明出处“数盟社区”和作者

博主简介：段石石，1号店精准化推荐算法工程师，主要负责1号店用户画像构建，喜欢钻研点Machine Learning的黑科技，对Deep Learning感兴趣，喜欢玩kaggle、看9神，对数据和Machine Learning有兴趣咱们可以一起聊聊，个人博客： hacker.duanshishi.com

Spark构建分类器

在本章中，我们会了解基本的分类器以及在Spark如何使用，以及一套如何对model进行评价、调参。MLlib在这一块还是比较强大的，但是对比sklearn无论是算法种类以及配套功能还是有很大的差距。不过，据传spark最近正在修改ml，参考sklearn中的pipeline框架，将所有对数据的操作写成一个管道，在model的选择、调参、评估将更加方便，像sklearn一样,下面是一些Kaggle比赛当中的一些代码，用一个Pipeline把数据流的所有操作集合在一起，这样就很方便地进行调参。

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 285px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        
 
       
         clf 
           
         = 
           
         pipeline 
         . 
         Pipeline 
         ( 
         [ 
        
 
         ( 
         'union' 
         , 
           
         FeatureUnion 
         ( 
        
 
                  
         transformer_list 
           
         = 
           
         [ 
        
 
                      
         ( 
         'cst' 
         , 
            
         cust_regression_vals 
         ( 
         ) 
         ) 
         , 
        
 
                      
         ( 
         'txt1' 
         , 
           
         pipeline 
         . 
         Pipeline 
         ( 
         [ 
         ( 
         's1' 
         , 
           
         cust_txt_col 
         ( 
         key 
         = 
         'search_term' 
         ) 
         ) 
         , 
           
         ( 
         'tfidf1' 
         , 
           
         tfidf 
         ) 
         ] 
         ) 
         ) 
         , 
        
 
                      
         ( 
         'txt2' 
         , 
           
         pipeline 
         . 
         Pipeline 
         ( 
         [ 
         ( 
         's2' 
         , 
           
         cust_txt_col 
         ( 
         key 
         = 
         'product_title' 
         ) 
         ) 
         , 
           
         ( 
         'tfidf2' 
         , 
           
         tfidf 
         ) 
         , 
           
         ( 
         'tsvd2' 
         , 
           
         tsvd 
         ) 
         ] 
         ) 
         ) 
         , 
        
 
                      
         ( 
         'txt3' 
         , 
           
         pipeline 
         . 
         Pipeline 
         ( 
         [ 
         ( 
         's3' 
         , 
           
         cust_txt_col 
         ( 
         key 
         = 
         'product_description' 
         ) 
         ) 
         , 
           
         ( 
         'tfidf3' 
         , 
           
         tfidf 
         ) 
         , 
           
         ( 
         'tsvd3' 
         , 
           
         tsvd 
         ) 
         ] 
         ) 
         ) 
         , 
        
 
                      
         ( 
         'txt4' 
         , 
           
         pipeline 
         . 
         Pipeline 
         ( 
         [ 
         ( 
         's4' 
         , 
           
         cust_txt_col 
         ( 
         key 
         = 
         'brand' 
         ) 
         ) 
         , 
           
         ( 
         'tfidf4' 
         , 
           
         tfidf 
         ) 
         , 
           
         ( 
         'tsvd4' 
         , 
           
         tsvd 
         ) 
         ] 
         ) 
         ) 
        
 
                  
         ] 
         , 
        
 
                  
         transformer_weights 
           
         = 
           
         { 
        
 
                      
         'cst' 
         : 
           
         1.0 
         , 
        
 
                      
         'txt1' 
         : 
           
         0.5 
         , 
        
 
                      
         'txt2' 
         : 
           
         0.25 
         , 
        
 
                      
         'txt3' 
         : 
           
         0.0 
         , 
        
 
                      
         'txt4' 
         : 
           
         0.5 
        
 
                  
         } 
         , 
        
 
                  
         n_jobs 
           
         = 
           
         1 
        
 
         ) 
         ) 
         , 
        
 
         ( 
         'xgbr' 
         , 
           
         xgbr 
         ) 
         ] 
         ) 
        
 
     
 
    
  

下面我们将分为以下几部分来聊下Spark MLlib中的分类器模块：

了解MLlib中支持的基本的分类器算法
利用Spark从原始数据当中提取特征
利用MLlib训练各种有代表性的模型
使用训练好的模型对数据进行预测
使用标准的评估手段对分类器模型来进行评估
使用一些数据处理的方法来提升model性能
探索在Spark MLlib如何进行Hyperparameter tuning，以及使用CV，来选择对应最优参数

MLlib中支持的分类器算法

Linear models

线性模型，顾名思义，在空间定一条直线来分割数据，从而来对数据进行判断，基本的model：

其中，y是目标变量，w是model的权重向量，x是输入的特征向量。这里我们可以变化f来更改model。
f确定后，一般会对应的decost函数。然后，我们在权重向量的参数空间寻优，找到cost函数值最小的一组最优参数，常用的cost函数包括logistic loss（logistic regression）、hinge loss（Linear Support Vector）以及最常见的Zero-one loss:

【投稿】Machine Learing With Spark Note 3：构建分类器_第1张图片

Logistic regression

在Logistic Regression中，f就是所谓的sigmoid函数：

Linear Support Vector Machines

在线性支持向量机中，f就是一个对等函数（？这里其实我也不知道为啥是这个名字），也就是本身：

在Linear Support Vector Machines中，我们使用的cost函数为hinge loss：

Logistic Regression和Support Vector Machines的分割线示意图：

Naive Bayes Model

Naive Bayes要求特征质检条件独立，是一种实际当中应用很多的分类方法

特征之间的属于类变量的概率相互独立，然后计算所有类变量，选择概率最大的那个C即是我们分给的类别。
一个简单的二值分类器的结果：

Decision trees

决策树的基本原理就是通过某些metrics选出最重要的属性node来对数据进行分割，然后依次进行分割，决策树是一个很流行的算法，也是一种很容易过拟合的算法，为了减少过拟合的产生，有其他ensemble的高级版，如Random Forest、GBDT，用来增强决策树算法的性能和鲁棒性

一个简单的决策树

从原始数据中提取合适的特征

在Supervised Learning中，提供LabeledPoint数据类型，

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 15px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
          1 
        
         case 
         class 
          
         LabeledPoint 
         ( 
         label 
         : 
          
         Double 
         , 
          
         features 
         : 
          
         Vector 
         )

从Kaggle StumbleUpon evergreen Dataset提取features

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 90px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
          1 
        
          2 
        
          3 
        
          4 
        
          5 
        
          6 
        
         # 去掉train中的header信息 
        
         ! 
         sed 
           
         1d 
           
         . 
         . 
         / 
         data 
         / 
         evergreen_classification 
         / 
         train 
         . 
         tsv 
           
         > 
           
         . 
         . 
         / 
         data 
         / 
         evergreen_classification 
         / 
         train_noheader 
         . 
         tsv 
        
         # 读入数据，以\t分割 
        
         rawData 
           
         = 
           
         sc 
         . 
         textFile 
         ( 
         '../data/evergreen_classification/train_noheader.tsv' 
         ) 
        
         records 
           
         = 
           
         rawData 
         . 
         map 
         ( 
         lambda 
           
         x 
           
         : 
           
         x 
         . 
         split 
         ( 
         '\t' 
         ) 
         ) 
        
         records 
         . 
         take 
         ( 
         4 
         )

数据内容如图：

取其中有用字段，并做初步处理（将？取代为0.0）

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 195px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        
 
       
         from 
          
         pyspark 
         . 
         mllib 
         . 
         regression 
         import 
          
         LabeledPoint 
        
 
         from 
          
         pyspark 
         . 
         mllib 
         . 
         linalg 
         import 
          
         Vectors 
        
 
         trimmed 
          
         = 
          
         records 
         . 
         map 
         ( 
         lambda 
          
         x 
         : 
          
         [ 
         xx 
         . 
         replace 
         ( 
         '\\' 
         , 
         ' ' 
         ) 
          
         for 
          
         xx 
         in 
          
         x 
         ] 
         ) 
        
 
         # data.first() 
        
 
         label 
          
         = 
          
         trimmed 
         . 
         map 
         ( 
         lambda 
          
         x 
          
         : 
          
         x 
         [ 
         - 
         1 
         ] 
         ) 
        
 
         # label.take(5) 
        
 
         # features =  trimmed.map(lambda x: x[4:-1]).map(lambda x: [ 0.0 if x=='?' else float(xx.replace("\"","")) for xx in x]) 
        
 
         # data = LabeledPoint(label,Vectors.dense(features)) 
        
 
         # data = trimmed.map(lambda x:(x[-1],x[4:-1])).map(lambda (x,y): (x,[ 0.0 if yy =='?' else float(yy.replace("\"","")) for yy in y])).map(LabeledPoint(label,features)) 
        
 
         # ?号时，文本里面存的是"?" 
        
 
         data 
          
         = 
          
         trimmed 
         . 
         map 
         ( 
         lambda 
          
         x 
         : 
         ( 
         x 
         [ 
         - 
         1 
         ] 
         , 
         x 
         [ 
         4 
         : 
         - 
         1 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
          
         ( 
         x 
         , 
         y 
         ) 
         : 
          
         ( 
         x 
         . 
         replace 
         ( 
         "\"" 
         , 
         "" 
         ) 
          
         , 
         [ 
          
         0.0 
          
         if 
          
         yy 
          
         == 
         '\"?\"' 
          
         else 
          
         yy 
         . 
         replace 
         ( 
         "\"" 
         , 
         "" 
         ) 
          
         for 
          
         yy 
         in 
          
         y 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
          
         ( 
         x 
         , 
         y 
         ) 
         : 
         ( 
         int 
         ( 
         x 
         ) 
         , 
         [ 
         float 
         ( 
         yy 
         ) 
          
         for 
          
         yy 
         in 
          
         y 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
          
         ( 
         x 
         , 
         y 
         ) 
         : 
         LabeledPoint 
         ( 
         x 
         , 
         Vectors 
         . 
         dense 
         ( 
         y 
         ) 
         ) 
         ) 
        
 
         # features.take(5) 
        
 
         data 
         . 
         take 
         ( 
         5 
         ) 
        
 
     
 
    
  

这里有一个小的细节就是里面存的是”123”而非123，在做处理时需要注意，这里代码写的比较粗糙，就先这样看看，后面再做类似处理的时候回先把这些”“处理掉，scala的代码中没有出现问题，具体不知道为什么，不过这个是小问题，注意下就可以了，这里就生成了后面做分类的数据结构LabeledPoint，很简单是不是。
下面，我们意义处理下nbData，为后面做Naive Bayes的数据，因为NB中是不允许存在负数的，这个很好理解，概率是不存在负的，对吧，但是数据当中有些，这里我们先不看具体意义，直接和书上一样，把负数做0.0处理，实际当中可能需要具体了解数据库，或者可能会对原先的数据进行一个概率统计才能用相关的Naive Bayes的算法。

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 75px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        
 
       
         # naive bayes要求feature为非负features 
        
 
         nbdata 
           
         = 
           
         trimmed 
         . 
         map 
         ( 
         lambda 
           
         x 
         : 
         ( 
         x 
         [ 
         - 
         1 
         ] 
         , 
         x 
         [ 
         4 
         : 
         - 
         1 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         ( 
         int 
         ( 
         x 
         . 
         replace 
         ( 
         "\"" 
         , 
         "" 
         ) 
         ) 
           
         , 
         [ 
           
         0.0 
           
         if 
           
         yy 
           
         == 
         '\"?\"' 
           
         else 
           
         float 
         ( 
         yy 
         . 
         replace 
         ( 
         "\"" 
         , 
         "" 
         ) 
         ) 
           
         for 
           
         yy  
         in 
           
         y 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         ( 
         x 
         , 
         [ 
         0.0 
           
         if 
           
         yy 
         < 
         0 
           
         else 
           
         yy  
         for 
           
         yy  
         in 
           
         y 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
         LabeledPoint 
         ( 
         x 
         , 
         Vectors 
         . 
         dense 
         ( 
         y 
         ) 
         ) 
         ) 
        
 
         # nbdata = trimmed.map(lambda x:(x[-1],x[4:-1])).map(lambda (x,y): (x.replace("\"","") ,[ 0.0 if yy =='\"?\"' else yy.replace("\"","") for yy in y])).map(lambda (x,y):(int(x),[float(yy) for yy in y])).map(lambda (x,y):[0.0 if yy<0  else float(yy) for yy in y]).map(lambda (x,y):LabeledPoint(x,Vectors.dense(y))) 
        
 
         print 
           
         nbdata 
         . 
         take 
         ( 
         5 
         ) 
        
 
         # nbdata.cache 
        
 
     
 
    
  

模型训练

这部分，我们直接调用Spark MLlib里面的分类器的接口，然后训练好对应的LR、SVM、NB、DT

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 255px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
          1 
        
          2 
        
          3 
        
          4 
        
          5 
        
          6 
        
          7 
        
          8 
        
          9 
        
          10 
        
          11 
        
          12 
        
          13 
        
          14 
        
          15 
        
          16 
        
          17 
        
         #Training a classifier using logistic regression, SVM, naïve Bayes, and a decision tree 
        
         from 
          
         pyspark 
         . 
         mllib 
         . 
         classification 
         import 
          
         LogisticRegressionWithSGD 
        
         from 
          
         pyspark 
         . 
         mllib 
         . 
         classification 
         import 
          
         SVMWithSGD 
        
         from 
          
         pyspark 
         . 
         mllib 
         . 
         classification 
         import 
          
         NaiveBayes 
        
         from 
          
         pyspark 
         . 
         mllib 
         . 
         tree 
         import 
          
         DecisionTree 
        
         # import pyspark.mllib.tree. 
        
         numIteration 
          
         = 
          
         10 
        
         maxTreeDepth 
          
         = 
          
         5 
        
         numClass 
          
         = 
          
         label 
         . 
         distinct 
         ( 
         ) 
         . 
         count 
         ( 
         ) 
        
         print 
          
         numClass 
        
         lrModel 
          
         = 
          
         LogisticRegressionWithSGD 
         . 
         train 
         ( 
         data 
         , 
          
         numIteration 
         ) 
        
         svmModel 
          
         = 
          
         SVMWithSGD 
         . 
         train 
         ( 
         data 
         , 
          
         numIteration 
         ) 
        
         nbModel 
          
         = 
          
         NaiveBayes 
         . 
         train 
         ( 
         nbdata 
         ) 
        
         # dtModel = DecisionTree.trainClassifier(data,2,impurity='entropy') 
        
         dtModel 
          
         = 
          
         DecisionTree 
         . 
         trainClassifier 
         ( 
         data 
         , 
         numClass 
         , 
         { 
         } 
         , 
         impurity 
         = 
         'entropy' 
         , 
          
         maxDepth 
         = 
         maxTreeDepth 
         ) 
        
         print 
          
         lrModel 
        
         print 
          
         dtModel

使用模型对数据进行预测

直接调用predict，对数据进行预测，很简单，直接看代码：

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 75px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
          1 
        
          2 
        
          3 
        
          4 
        
          5 
        
         # using these models 
        
         dataPoint 
           
         = 
           
         data 
         . 
         first 
         ( 
         ) 
        
         prediction 
           
         = 
           
         lrModel 
         . 
         predict 
         ( 
         dataPoint 
         . 
         features 
         ) 
        
         trueLabel 
           
         = 
           
         dataPoint 
         . 
         label 
        
         print 
           
         'The true label is %s, and the predict label is %s' 
         % 
         ( 
         trueLabel 
         , 
           
         prediction 
         )

模型评估

Accuracy and Prediction Error

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 405px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        

          20 
        

          21 
        

          22 
        

          23 
        

          24 
        

          25 
        

          26 
        

          27 
        
 
       
         # Evaluating the classifier 
        
 
         lrTotalCorrect 
          
         = 
          
         data 
         . 
         map 
         ( 
         lambda 
          
         lp 
          
         : 
          
         1 
          
         if 
         ( 
         lrModel 
         . 
         predict 
         ( 
         lp 
         . 
         features 
         ) 
         == 
         lp 
         . 
         label 
         ) 
          
         else 
          
         0 
         ) 
         . 
         sum 
         ( 
         ) 
        
 
         svmTotalCorrect 
          
         = 
          
         data 
         . 
         map 
         ( 
         lambda 
          
         lp 
          
         : 
          
         1 
          
         if 
         ( 
         svmModel 
         . 
         predict 
         ( 
         lp 
         . 
         features 
         ) 
         == 
         lp 
         . 
         label 
         ) 
          
         else 
          
         0 
         ) 
         . 
         sum 
         ( 
         ) 
        
 
         nbTotalCorrect 
          
         = 
          
         nbdata 
         . 
         map 
         ( 
         lambda 
          
         lp 
         : 
          
         1 
          
         if 
          
         ( 
         nbModel 
         . 
         predict 
         ( 
         lp 
         . 
         features 
         ) 
          
         == 
          
         lp 
         . 
         label 
         ) 
          
         else 
          
         0 
         ) 
         . 
         sum 
         ( 
         ) 
        
 
         # dtTotalCorrect = data.map(lambda lp: 1 if (dtModel.predict(lp.features) == lp.label) else 0).sum() 
        
 
         # 要查下这里为什么会有问题，只能用后面的写法 
        
 
         # dtTotalCorrect = data.map(lambda lp: 1 if (dtModel.predict(lp.features) == lp.label) else 0).sum() 
        
 
         # predictionAndLabel = data.map(lambda lp: (dtModel.predict(lp.features),lp.label)) 
        
 
         # print predictionAndLabel.take(5) 
        
 
         # dtTotalCorrect = predictionAndLabel.map(lambda (x,y): 1.0 if x==y else 0.0).sum() 
        
 
         # labels = data.map(lambda lp:lp.label).zip(prediction) 
        
 
         predictList 
         = 
          
         dtModel 
         . 
         predict 
         ( 
         data 
         . 
         map 
         ( 
         lambda 
          
         lp 
         : 
          
         lp 
         . 
         features 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
         trueLabel 
          
         = 
          
         data 
         . 
         map 
         ( 
         lambda 
          
         lp 
         : 
          
         lp 
         . 
         label 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
         # # diff = abs(predictList-trueLabel) 
        
 
         dtTotalCorrect 
          
         = 
          
         sum 
         ( 
         [ 
         1.0 
          
         if 
          
         predictVal 
          
         == 
          
         trueLabel 
         [ 
         i 
         ] 
          
         else 
          
         0.0 
          
         for 
          
         i 
         , 
          
         predictVal 
         in 
          
         enumerate 
         ( 
         predictList 
         ) 
         ] 
         ) 
        
 
         # dtTotalCorrect = sum(diff) 
        
 
         # print dtTotalCorrect 
        
 
         lrAccuracy 
          
         = 
          
         lrTotalCorrect 
         / 
         ( 
         data 
         . 
         count 
         ( 
         ) 
         * 
         1.0 
         ) 
        
 
         svmAccuracy 
          
         = 
          
         svmTotalCorrect 
         / 
         ( 
         data 
         . 
         count 
         ( 
         ) 
         * 
         1.0 
         ) 
        
 
         nbAccuracy 
          
         = 
          
         nbTotalCorrect 
         / 
         ( 
         1.0 
         * 
         nbdata 
         . 
         count 
         ( 
         ) 
         ) 
        
 
         dtAccuracy 
          
         = 
          
         dtTotalCorrect 
         / 
         ( 
         1.0 
         * 
         data 
         . 
         count 
         ( 
         ) 
         ) 
        
 
         print 
          
         '------------data count: %s------------' 
         % 
         data 
         . 
         count 
         ( 
         ) 
        
 
         print 
          
         '------------lr Model Accuracy: %s------------' 
         % 
         lrAccuracy 
        
 
         print 
          
         '------------svm Model Accuracy: %f------------' 
         % 
         svmAccuracy 
        
 
         print 
          
         '------------nb Model Accuracy: %f------------' 
         % 
         nbAccuracy 
        
 
         print 
          
         '------------dt Model Accuracy: %f------------' 
         % 
         dtAccuracy 
        
 
         print 
          
         '-----------------------done-----------------------' 
        
 
     
 
    
  

模型Accuracy：

Precision and Recall

有了前面的Accuracy，为什么又要多一个Precision and Recall呢？其实，评估标准在机器学习里面算是特别重要的一块，具体可以看看机器学习模型评估，需要指出的是，Precision and Recall在这篇文章中讲的是Ranking Metrics，原理差不多都是一个准确率和召回率的综合考虑,抛开召回率，单独谈准确率是一个非常不专业的行为，下图是一个spark中各种metrics的基本解释：

一个分类器的Percision-recall curve:

ROC curve and AUC

ROC和PR曲线类似，用来表明特点False Positive Rate下的True Positive Rate,这里我就直接用英文表示了，感觉翻译的真阳性、假阳性感觉好二。举个例子来说明，一个垃圾邮件分类器，TPR表示的是所有被正确分类为垃圾邮件的数量与所有垃圾邮件数量的比值，FPR表示所有被判断为垃圾邮件的正常邮件与所有正常邮件的比值。FPR和TPR构建x,y坐标轴，然后就会有对应的ROC Curve。

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 540px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        

          20 
        

          21 
        

          22 
        

          23 
        

          24 
        

          25 
        

          26 
        

          27 
        

          28 
        

          29 
        

          30 
        

          31 
        

          32 
        

          33 
        

          34 
        

          35 
        

          36 
        
 
       
         # 计算AUC、和AUPR 
        
 
         # import pyspark.mllib.evaluation.BinaryClassificationMetrics 
        
 
         from 
           
         pyspark 
         . 
         mllib 
         . 
         evaluation  
         import 
           
         BinaryClassificationMetrics 
        
 
         all_models_metrics 
           
         = 
           
         [ 
         ] 
        
 
         for 
           
         model  
         in 
           
         [ 
         lrModel 
         , 
         svmModel 
         ] 
         : 
        
 
              
         scoresAndLabels 
           
         = 
           
         data 
         . 
         map 
         ( 
         lambda 
           
         point 
         : 
         ( 
         model 
         . 
         predict 
         ( 
         point 
         . 
         features 
         ) 
         , 
         point 
         . 
         label 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         scoresAndLabels 
           
         = 
           
         [ 
         ( 
         float 
         ( 
         i 
         ) 
         , 
         j 
         ) 
           
         for 
           
         ( 
         i 
         , 
         j 
         ) 
           
         in 
           
         scoresAndLabels 
         ] 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         sc 
         . 
         parallelize 
         ( 
         scoresAndLabels 
         ) 
        
 
              
         metrics 
           
         = 
           
         BinaryClassificationMetrics 
         ( 
         scoresAndLabels_sc 
         ) 
        
 
              
         all_models_metrics 
         . 
         append 
         ( 
         ( 
         model 
         . 
         __class__ 
         . 
         __name__ 
         , 
         metrics 
         . 
         areaUnderROC 
         , 
           
         metrics 
         . 
         areaUnderPR 
         ) 
         ) 
        
 
         print 
           
         all_models_metrics 
        
 
         for 
           
         model  
         in 
           
         [ 
         nbModel 
         ] 
         : 
        
 
              
         # float(model.predict(point.features)) is important or get a error  
        
 
              
         #'DoubleType can not accept object in type <type 'numpy.float64'>' 
        
 
              
         scoresAndLabels 
           
         = 
           
         nbdata 
         . 
         map 
         ( 
         lambda 
           
         point 
         : 
         ( 
         float 
         ( 
         model 
         . 
         predict 
         ( 
         point 
         . 
         features 
         ) 
         ) 
         , 
         point 
         . 
         label 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         #scoresAndLabeles = [(1.0*i,j) for (i,j) in scoresAndLabeles] 
        
 
              
         #print scoresAndLabeles 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         sc 
         . 
         parallelize 
         ( 
         scoresAndLabels 
         ) 
        
 
              
         #print scoresAndLabeles 
        
 
              
         scoresAndLabeles_sc 
           
         = 
           
         scoresAndLabels_sc 
        
 
              
         nb_metrics 
           
         = 
           
         BinaryClassificationMetrics 
         ( 
         scoresAndLabels_sc 
         ) 
        
 
              
         all_models_metrics 
         . 
         append 
         ( 
         ( 
         model 
         . 
         __class__ 
         . 
         __name__ 
         , 
           
         nb_metrics 
         . 
         areaUnderROC 
         , 
           
         nb_metrics 
         . 
         areaUnderPR 
         ) 
         ) 
        
 
         print 
           
         all_models_metrics 
        
 
         for 
           
         model  
         in 
           
         [ 
         dtModel 
         ] 
         : 
        
 
         #     scoresAndLabeles = data.map(lambda point:(model.predict(point.features),point.label)).collect() 
        
 
              
         predictList 
         = 
           
         dtModel 
         . 
         predict 
         ( 
         data 
         . 
         map 
         ( 
         lambda 
           
         lp 
         : 
           
         lp 
         . 
         features 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         trueLabel 
           
         = 
           
         data 
         . 
         map 
         ( 
         lambda 
           
         lp 
         : 
           
         lp 
         . 
         label 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
         #     scoresAndLabeles = [(1.0*i,j) for (i,j) in scoresAndLabeles] 
        
 
         #     print scoresAndLabeles 
        
 
              
         scoresAndLabels 
           
         = 
           
         [ 
         ( 
         predictList 
         [ 
         i 
         ] 
         , 
         true_val 
         ) 
           
         for 
           
         i 
         , 
           
         true_val  
         in 
           
         enumerate 
         ( 
         trueLabel 
         ) 
         ] 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         sc 
         . 
         parallelize 
         ( 
         scoresAndLabels 
         ) 
        
 
         #     print scoresAndLabeles 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         scoresAndLabels_sc 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         ( 
         float 
         ( 
         x 
         ) 
         , 
         float 
         ( 
         y 
         ) 
         ) 
         ) 
        
 
              
         dt_metrics 
           
         = 
           
         BinaryClassificationMetrics 
         ( 
         scoresAndLabels_sc 
         ) 
        
 
              
         all_models_metrics 
         . 
         append 
         ( 
         ( 
         model 
         . 
         __class__ 
         . 
         __name__ 
         , 
           
         dt_metrics 
         . 
         areaUnderROC 
         , 
           
         dt_metrics 
         . 
         areaUnderPR 
         ) 
         ) 
        
 
         print 
           
         all_models_metrics 
        
 
     
 
    
  

模型调参、提高模型性能

特征标准化

在机器学习的方法中，对特征进行标准化是特别重要的工作，何为standardization？!举个例子，小明数学考了82分、语文考了90分，那我们能说明小明语文考的比数学好吗？显然不是，我们必须知道全班其他学生的考试情况，才能对比小明语文和数学谁考的更好，那么说了这么多，到底为啥要做standardization呢？这里截取了一张Andrew Ng课程上的截图来说明：

如果不在同一个标准下，很容易出现左图中的情况，这样一个寻优路径上很容易为”之”字形，而右图则相对于左图的”之”字形能快速寻优，达到更快速的收敛，在一定程度上提高模型精确性。

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 345px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        

          20 
        

          21 
        

          22 
        

          23 
        
 
       
         from 
          
         pyspark 
         . 
         mllib 
         . 
         feature 
         import 
          
         StandardScalerModel 
         , 
         StandardScaler 
        
 
         scaler 
          
         = 
          
         StandardScaler 
         ( 
         withMean 
         = 
         True 
         , 
          
         withStd 
         = 
         True 
         ) 
         . 
         fit 
         ( 
         vectors 
         ) 
        
 
         labels 
          
         = 
          
         data 
         . 
         map 
         ( 
         lambda 
          
         lp 
         : 
          
         lp 
         . 
         label 
         ) 
        
 
         features 
          
         = 
          
         data 
         . 
         map 
         ( 
         lambda 
          
         lp 
         : 
          
         lp 
         . 
         features 
         ) 
        
 
         print 
          
         features 
         . 
         take 
         ( 
         5 
         ) 
        
 
         scaled_data 
          
         = 
          
         labels 
         . 
         zip 
         ( 
         scaler 
         . 
         transform 
         ( 
         features 
         ) 
         ) 
        
 
         scaled_data 
          
         = 
          
         scaled_data 
         . 
         map 
         ( 
         lambda 
          
         ( 
         x 
         , 
         y 
         ) 
         : 
          
         LabeledPoint 
         ( 
         x 
         , 
         y 
         ) 
         ) 
        
 
         print 
          
         scaled_data 
         . 
         first 
         ( 
         ) 
         . 
         features 
        
 
         print 
          
         data 
         . 
         first 
         ( 
         ) 
         . 
         features 
        
 
         # 用标准化数据来训练lr模型 
        
 
         lrModelScaled 
          
         = 
          
         LogisticRegressionWithSGD 
         . 
         train 
         ( 
         scaled_data 
         , 
          
         numIteration 
         ) 
        
 
         lrTotalCorrectScaled 
          
         = 
          
         scaled_data 
         . 
         map 
         ( 
         lambda 
          
         lp 
          
         : 
          
         1 
          
         if 
         ( 
         lrModelScaled 
         . 
         predict 
         ( 
         lp 
         . 
         features 
         ) 
         == 
         lp 
         . 
         label 
         ) 
          
         else 
          
         0 
         ) 
         . 
         sum 
         ( 
         ) 
        
 
         lrAccuracyScaled 
          
         = 
          
         lrTotalCorrectScaled 
         / 
         ( 
         1.0 
         * 
         data 
         . 
         count 
         ( 
         ) 
         ) 
        
 
         print 
          
         'lrAccuracyscaled : %f' 
         % 
         lrAccuracyScaled 
        
 
         all_models_metrics 
          
         = 
         [ 
         ] 
        
 
         for 
          
         model 
         in 
          
         [ 
         lrModelScaled 
         ] 
         : 
        
 
              
         scoresAndLabels 
          
         = 
          
         scaled_data 
         . 
         map 
         ( 
         lambda 
          
         point 
         : 
         ( 
         model 
         . 
         predict 
         ( 
         point 
         . 
         features 
         ) 
         , 
         point 
         . 
         label 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         scoresAndLabels 
          
         = 
          
         [ 
         ( 
         float 
         ( 
         i 
         ) 
         , 
         j 
         ) 
          
         for 
          
         ( 
         i 
         , 
         j 
         ) 
          
         in 
          
         scoresAndLabels 
         ] 
        
 
              
         scoresAndLabels_sc 
          
         = 
          
         sc 
         . 
         parallelize 
         ( 
         scoresAndLabels 
         ) 
        
 
              
         metrics 
          
         = 
          
         BinaryClassificationMetrics 
         ( 
         scoresAndLabels_sc 
         ) 
        
 
              
         all_models_metrics 
         . 
         append 
         ( 
         ( 
         model 
         . 
         __class__ 
         . 
         __name__ 
         , 
         metrics 
         . 
         areaUnderROC 
         , 
          
         metrics 
         . 
         areaUnderPR 
         ) 
         ) 
        

            
        
 
         print 
          
         all_models_metrics 
        
 
     
 
    
  

Accuracy:0.620960

最终结果，相对于未标准化的数据模型在accuracy和AUC上有比较明显的提升，PR为啥没有提升，不是特别清楚，书上也没有说。。。

增加数据特征

这里我们将原始数据中的第4列（category variable）编码为K为二值变量（dummies）：

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 795px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        

          20 
        

          21 
        

          22 
        

          23 
        

          24 
        

          25 
        

          26 
        

          27 
        

          28 
        

          29 
        

          30 
        

          31 
        

          32 
        

          33 
        

          34 
        

          35 
        

          36 
        

          37 
        

          38 
        

          39 
        

          40 
        

          41 
        

          42 
        

          43 
        

          44 
        

          45 
        

          46 
        

          47 
        

          48 
        

          49 
        

          50 
        

          51 
        

          52 
        

          53 
        
 
       
         categories 
           
         = 
           
         records 
         . 
         map 
         ( 
         lambda 
           
         x 
         : 
           
         x 
         [ 
         3 
         ] 
         ) 
         . 
         distinct 
         ( 
         ) 
         . 
         zipWithIndex 
         ( 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
         category_dict 
           
         = 
           
         { 
         } 
        
 
         categories 
        
 
         for 
            
         ( 
         x 
         , 
         y 
         ) 
           
         in 
           
         [ 
         ( 
         key 
         . 
         replace 
         ( 
         '\"' 
         , 
         '' 
         ) 
           
         , 
         val 
         ) 
           
         for 
           
         ( 
         key 
         , 
           
         val 
         ) 
           
         in 
           
         categories 
         ] 
         : 
        
 
              
         category_dict 
         [ 
         x 
         ] 
           
         = 
           
         y 
        
 
         num_categories 
           
         = 
           
         len 
         ( 
         category_dict 
         ) 
        
 
         otherdata 
           
         = 
           
         trimmed 
         . 
         map 
         ( 
         lambda 
           
         x 
         : 
         ( 
         x 
         [ 
         - 
         1 
         ] 
         , 
         x 
         [ 
         4 
         : 
         - 
         1 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         ( 
         x 
         . 
         replace 
         ( 
         "\"" 
         , 
         "" 
         ) 
           
         , 
         [ 
           
         0.0 
           
         if 
           
         yy 
           
         == 
         '\"?\"' 
           
         else 
           
         yy 
         . 
         replace 
         ( 
         "\"" 
         , 
         "" 
         ) 
           
         for 
           
         yy  
         in 
           
         y 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
         ( 
         int 
         ( 
         x 
         ) 
         , 
         [ 
         float 
         ( 
         yy 
         ) 
           
         for 
           
         yy  
         in 
           
         y 
         ] 
         ) 
         ) 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
         LabeledPoint 
         ( 
         x 
         , 
         Vectors 
         . 
         dense 
         ( 
         y 
         ) 
         ) 
         ) 
        
 
         otherdata 
         . 
         take 
         ( 
         5 
         ) 
        

            
        
 
         def 
           
         func1 
         ( 
         x 
         ) 
         : 
        
 
         # 这里把前面的合在一起做了，然后最终把category_feature和other_feature合在一起 
        
 
              
         import 
           
         numpy  
         as 
           
         np 
        
 
              
         label 
           
         = 
           
         x 
         [ 
         - 
         1 
         ] 
         . 
         replace 
         ( 
         '\"' 
         , 
         '' 
         ) 
        
 
              
         other_feature 
           
         = 
           
         [ 
         0.0 
           
         if 
           
         yy 
           
         == 
           
         '?' 
           
         else 
           
         yy  
         for 
           
         yy  
         in 
           
         [ 
           
         y 
         . 
         replace 
         ( 
         '\"' 
         , 
         '' 
         ) 
           
         for 
           
         y 
           
         in 
           
         x 
         [ 
         4 
         : 
         - 
         1 
         ] 
         ] 
         ] 
        
 
              
         category_Idx 
           
         = 
           
         category_dict 
         [ 
         x 
         [ 
         3 
         ] 
         . 
         replace 
         ( 
         '\"' 
         , 
         '' 
         ) 
         ] 
        
 
              
         category_feature 
           
         = 
           
         np 
         . 
         zeros 
         ( 
         num_categories 
         ) 
        
 
              
         category_feature 
         [ 
         category_Idx 
         ] 
           
         = 
           
         1 
        
 
              
         return 
           
         LabeledPoint 
         ( 
         label 
         , 
           
         Vectors 
         . 
         dense 
         ( 
         list 
         ( 
         category_feature 
         ) 
         + 
         other_feature 
         ) 
         ) 
        
 
         category_data 
           
         = 
           
         trimmed 
         . 
         map 
         ( 
         lambda 
           
         x 
         : 
         func1 
         ( 
         x 
         ) 
         ) 
        
 
         category_data 
         . 
         take 
         ( 
         5 
         ) 
        
 
         # category_data.take(5) 
        
 
         category_labels 
           
         = 
           
         category_data 
         . 
         map 
         ( 
         lambda 
           
         lp 
         : 
           
         lp 
         . 
         label 
         ) 
        
 
         category_features 
           
         = 
           
         category_data 
         . 
         map 
         ( 
         lambda 
           
         lp 
         : 
           
         lp 
         . 
         features 
         ) 
        
 
         scaler2 
           
         = 
           
         StandardScaler 
         ( 
         withMean 
         = 
         True 
         , 
           
         withStd 
         = 
         True 
         ) 
         . 
         fit 
         ( 
         category_features 
         ) 
        
 
         print 
           
         category_features 
         . 
         take 
         ( 
         5 
         ) 
        
 
         scaled_category_data 
           
         = 
           
         category_labels 
         . 
         zip 
         ( 
         scaler2 
         . 
         transform 
         ( 
         category_features 
         ) 
         ) 
        
 
         scaled_category_data 
           
         = 
           
         scaled_category_data 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         LabeledPoint 
         ( 
         x 
         , 
         y 
         ) 
         ) 
        
 
         print 
           
         scaled_category_data 
         . 
         take 
         ( 
         5 
         ) 
        

            
        
 
         # 取出label和features，然后对features做Standardization 
        
 
         category_labels 
           
         = 
           
         category_data 
         . 
         map 
         ( 
         lambda 
           
         lp 
         : 
           
         lp 
         . 
         label 
         ) 
        
 
         category_features 
           
         = 
           
         category_data 
         . 
         map 
         ( 
         lambda 
           
         lp 
         : 
           
         lp 
         . 
         features 
         ) 
        
 
         scaler2 
           
         = 
           
         StandardScaler 
         ( 
         withMean 
         = 
         True 
         , 
           
         withStd 
         = 
         True 
         ) 
         . 
         fit 
         ( 
         category_features 
         ) 
        
 
         print 
           
         category_features 
         . 
         take 
         ( 
         5 
         ) 
           
        
 
         scaled_category_data 
           
         = 
           
         category_labels 
         . 
         zip 
         ( 
         scaler2 
         . 
         transform 
         ( 
         category_features 
         ) 
         ) 
        
 
         scaled_category_data 
           
         = 
           
         scaled_category_data 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         LabeledPoint 
         ( 
         x 
         , 
         y 
         ) 
         ) 
        
 
         print 
           
         scaled_category_data 
         . 
         take 
         ( 
         5 
         ) 
        

            
        
 
         # fit添加了category var的数据 
        
 
         lrModel_category_scaled 
           
         = 
           
         LogisticRegressionWithSGD 
         . 
         train 
         ( 
         scaled_category_data 
         , 
           
         numIteration 
         ) 
        
 
         lr_totalCorrect_category_scaled 
           
         = 
           
         scaled_category_data 
         . 
         map 
         ( 
         lambda 
           
         lp 
           
         : 
           
         1 
                      
         if 
         ( 
         lrModel_category_scaled 
         . 
         predict 
         ( 
         lp 
         . 
         features 
         ) 
         == 
         lp 
         . 
         label 
         ) 
           
         else 
           
         0 
         ) 
         . 
         sum 
         ( 
         ) 
        
 
         lr_accuracy_category_scaled 
           
         = 
           
         lr_totalCorrect_category_scaled 
         / 
         ( 
         1.0 
         * 
         data 
         . 
         count 
         ( 
         ) 
         ) 
        
 
         print 
           
         'lrModel_category_scaled : %f' 
         % 
         lr_accuracy_category_scaled 
        

            
        
 
         all_models_metrics 
           
         = 
         [ 
         ] 
        
 
         for 
           
         model  
         in 
           
         [ 
         lrModel_category_scaled 
         ] 
         : 
        
 
              
         scoresAndLabels 
           
         = 
           
         scaled_category_data 
         . 
         map 
         ( 
         lambda 
           
         point 
         : 
         ( 
         model 
         . 
         predict 
         ( 
         point 
         . 
         features 
         ) 
         , 
         point 
         . 
         label 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         scoresAndLabels 
           
         = 
           
         [ 
         ( 
         float 
         ( 
         i 
         ) 
         , 
         j 
         ) 
           
         for 
           
         ( 
         i 
         , 
         j 
         ) 
           
         in 
           
         scoresAndLabels 
         ] 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         sc 
         . 
         parallelize 
         ( 
         scoresAndLabels 
         ) 
        
 
              
         metrics 
           
         = 
           
         BinaryClassificationMetrics 
         ( 
         scoresAndLabels_sc 
         ) 
        
 
              
         all_models_metrics 
         . 
         append 
         ( 
         ( 
         model 
         . 
         __class__ 
         . 
         __name__ 
         , 
         metrics 
         . 
         areaUnderROC 
         , 
           
         metrics 
         . 
         areaUnderPR 
         ) 
         ) 
        

            
        
 
         print 
           
         all_models_metrics 
        
 
     
 
    
  

Accuray：0.665720

在添加了category variables后，分类器性能进一步提升：Accuracy由0.620960->0.665720,AUC由0.62->0.665,说明增加了这些特征数据后，是很有效的。

Hyperparameter tuning

Linear Models

Iterations

这里就是如何取最优参数，具体直接看代码吧，很容易的，包括对Iterations，step size，regularization params。

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 315px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        

          20 
        

          21 
        
 
       
         def 
          
         train_with_params 
         ( 
         input 
         , 
          
         reg_param 
         , 
          
         num_iter 
         , 
          
         step_size 
         ) 
         : 
        
 
              
         lr_model 
          
         = 
          
         LogisticRegressionWithSGD 
         . 
         train 
         ( 
         input 
         , 
         iterations 
         = 
         num_iter 
         , 
          
         regParam 
         = 
         reg_param 
         , 
          
         step 
         = 
         step_size 
         ) 
        
 
              
         return 
          
         lr_model 
        
 
         def 
          
         create_metrics 
         ( 
         tag 
         , 
          
         data 
         , 
          
         model 
         ) 
         : 
        
 
              
         score_labels 
          
         = 
          
         data 
         . 
         map 
         ( 
         lambda 
          
         x 
         : 
          
         ( 
         model 
         . 
         predict 
         ( 
         x 
         . 
         features 
         ) 
         * 
         1.0 
         , 
         x 
         . 
         label 
         * 
         1.0 
         ) 
         ) 
        
 
         #     score_labels_sc = sc.parallelize(score_labels) 
        
 
              
         metrics 
          
         = 
          
         BinaryClassificationMetrics 
         ( 
         score_labels 
         ) 
        
 
              
         return 
          
         tag 
         , 
         metrics 
         . 
         areaUnderROC 
        

            
        
 
         for 
          
         i 
          
         in 
          
         [ 
         1 
         , 
         5 
         , 
         10 
         , 
         50 
         ] 
         : 
        
 
              
         model 
          
         = 
          
         train_with_params 
         ( 
         scaled_category_data 
         , 
          
         0.0 
         , 
          
         i 
         , 
          
         1.0 
         ) 
        
 
              
         label 
         , 
          
         roc 
          
         = 
          
         create_metrics 
         ( 
         '%d iterations' 
         % 
         i 
         , 
         scaled_category_data 
         , 
         model 
         ) 
        
 
              
         print 
          
         '%s,AUC = %2.2f%%' 
         % 
         ( 
         label 
         , 
         roc 
         * 
         100 
         ) 
        
 
         for 
          
         s 
          
         in 
          
         [ 
         0.001 
         , 
          
         0.01 
         , 
          
         0.1 
         , 
          
         1.0 
         , 
          
         10.0 
         ] 
         : 
        
 
              
         model 
          
         = 
          
         train_with_params 
         ( 
         scaled_category_data 
         , 
          
         0.0 
         , 
          
         10 
         , 
          
         s 
         ) 
        
 
              
         label 
         , 
          
         roc 
          
         = 
          
         create_metrics 
         ( 
         '%f step size' 
         % 
         s 
         , 
         scaled_category_data 
         , 
         model 
         ) 
        
 
              
         print 
          
         '%s,AUC = %2.2f%%' 
         % 
         ( 
         label 
         , 
         roc 
         * 
         100 
         ) 
        
 
         for 
          
         r 
          
         in 
          
         [ 
         0.001 
         , 
          
         0.01 
         , 
          
         0.1 
         , 
          
         1.0 
         , 
          
         10.0 
         ] 
         : 
        
 
              
         model 
          
         = 
          
         train_with_params 
         ( 
         scaled_category_data 
         , 
          
         0.0 
         , 
          
         1.0 
         , 
          
         r 
         ) 
        
 
              
         label 
         , 
          
         roc 
          
         = 
          
         create_metrics 
         ( 
         '%f regularization parameter' 
         % 
         r 
         , 
         scaled_category_data 
         , 
         model 
         ) 
        
 
              
         print 
          
         '%s,AUC = %2.2f%%' 
         % 
         ( 
         label 
         , 
         roc 
         * 
         100 
         ) 
        
 
     
 
    
  

Decision trees

Depth and impurity

决策树，我们来看看maxTreeDepth和impurity对最终决策树的性能影响：

    <textarea wrap="soft" class="crayon-plain print-no" data-settings="dblclick" readonly="readonly" style="margin: 0px; padding-top: 0px; padding-right: 5px; padding-left: 5px; width: 823px; overflow: hidden; height: 240px; position: absolute; opacity: 0; border: 0px; border-radius: 0px; box-shadow: none; white-space: pre; word-wrap: normal; resize: none; color: rgb(0, 0, 0); tab-size: 4; z-index: 0; font-family: Monaco, MonacoRegular, 'Courier New', monospace !important; font-size: 12px !important; line-height: 15px !important; background-image: initial; background-attachment: initial; background-size: initial; background-origin: initial; background-clip: initial; background-position: initial; background-repeat: initial;"></textarea> 
  
 
    
     
      
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        
 
       
         def  
         train_with_params_dt 
         ( 
         input 
         , 
           
         impurity 
         , 
           
         maxTreeDepth 
         ) 
         : 
        
 
              
         dt_model 
           
         = 
           
         DecisionTree 
         . 
         trainClassifier 
         ( 
         input 
         , 
         numClass 
         , 
         { 
         } 
         , 
         impurity 
         , 
           
         maxDepth 
         = 
         maxTreeDepth 
         ) 
        
 
              
         return 
           
         dt_model 
        
 
         def  
         create_metrics_dt 
         ( 
         tag 
         , 
           
         data 
         , 
           
         model 
         ) 
         : 
        
 
              
         predictList 
         = 
           
         model 
         . 
         predict 
         ( 
         data 
         . 
         map 
         ( 
         lambda  
         lp 
         : 
           
         lp 
         . 
         features 
         ) 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         trueLabel 
           
         = 
           
         data 
         . 
         map 
         ( 
         lambda  
         lp 
         : 
           
         lp 
         . 
         label 
         ) 
         . 
         collect 
         ( 
         ) 
        
 
              
         scoresAndLabels 
           
         = 
           
         [ 
         ( 
         predictList 
         [ 
         i 
         ] 
         , 
         true_val 
         ) 
           
         for 
           
         i 
         , 
           
         true_val  
         in 
           
         enumerate 
         ( 
         trueLabel 
         ) 
         ] 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         sc 
         . 
         parallelize 
         ( 
         scoresAndLabels 
         ) 
        
 
              
         scoresAndLabels_sc 
           
         = 
           
         scoresAndLabels_sc 
         . 
         map 
         ( 
         lambda 
           
         ( 
         x 
         , 
         y 
         ) 
         : 
           
         ( 
         float 
         ( 
         x 
         ) 
         , 
         float 
         ( 
         y 
         ) 
         ) 
         ) 
        
 
              
         dt_metrics 
           
         = 
           
         BinaryClassificationMetrics 
         ( 
         scoresAndLabels_sc 
         ) 
        
 
              
         return 
           
         tag 
         , 
         dt_metrics 
         . 
         areaUnderROC 
        
 
         for 
           
         dep  
         in 
           
         [1 
         ,2 
         ,3 
         ,4 
         ,5 
         ,10 
         ,20 
         ] 
         : 
        
 
              
         for 
           
         im  
         in 
           
         [ 
         'entropy' 
         , 
         'gini' 
         ] 
         : 
        
 
                  
         model 
         = 
        
 
     
 
    
  

你可能感兴趣的:(算法,spark,博客,机器学习,深度学习)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
OC语言多界面传值五大方式 Magnetic_h ios ui 学习 objective-c 开发语言
前言在完成暑假仿写项目时，遇到了许多需要用到多界面传值的地方，这篇博客来总结一下比较常用的五种多界面传值的方式。属性传值属性传值一般用前一个界面向后一个界面传值，简单地说就是通过访问后一个视图控制器的属性来为它赋值，通过这个属性来做到从前一个界面向后一个界面传值。首先在后一个界面中定义属性@interfaceBViewController:UIViewController@propertyNSSt
Goolge earth studio 进阶4——路径修改与平滑陟彼高冈yu Google earth studio 进阶教程旅游
如果我们希望在大约中途时获得更多的城市鸟瞰视角。可以将相机拖动到这里并创建一个新的关键帧。camera_target_clip_7EarthStudio会自动平滑我们的路径，所以当我们通过这个关键帧时，不是一个生硬的角度，而是一个平滑的曲线。camera_target_clip_8路径上有贝塞尔控制手柄，允许我们调整路径的形状。右键单击，我们可以选择“平滑路径”，这是默认的自动平滑算法，或者我们可
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
基于社交网络算法优化的二维最大熵图像分割智能算法研学社（Jack旭）智能优化算法应用图像分割算法 php 开发语言
智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码文章目录智能优化算法应用：基于社交网络优化的二维最大熵图像阈值分割-附代码1.前言2.二维最大熵阈值分割原理3.基于社交网络优化的多阈值分割4.算法结果：5.参考文献：6.Matlab代码摘要：本文介绍基于最大熵的图像分割，并且应用社交网络算法进行阈值寻优。1.前言阅读此文章前，请阅读《图像分割：直方图区域划分及信息统计介绍》htt
509. 斐波那契数(每日一题) lzyprime
lzyprime博客(github)创建时间：2021.01.04qq及邮箱：2383518170leetcode笔记题目描述斐波那契数，通常用F(n)表示，形成的序列称为斐波那契数列。该数列由0和1开始，后面的每一项数字都是前面两项数字的和。也就是：F(0)=0，F(1)=1F(n)=F(n-1)+F(n-2)，其中n>1给你n，请计算F(n)。示例1：输入：2输出：1解释：F(2)=F(1)+
121. 买卖股票的最佳时机薄荷糖的味道_fb40
给定一个数组，它的第i个元素是一支给定股票第i天的价格。如果你最多只允许完成一笔交易（即买入和卖出一支股票），设计一个算法来计算你所能获取的最大利润。注意你不能在买入股票前卖出股票。示例1:输入:[7,1,5,3,6,4]输出:5解释:在第2天（股票价格=1）的时候买入，在第5天（股票价格=6）的时候卖出，最大利润=6-1=5。注意利润不能是7-1=6,因为卖出价格需要大于买入价格。示例2:输入:
每日算法&面试题，大厂特训二十八天——第二十天（树）肥学 ⚡算法题⚡面试题每日精进 java 算法数据结构
目录标题导读算法特训二十八天面试题点击直接资料领取导读肥友们为了更好的去帮助新同学适应算法和面试题，最近我们开始进行专项突击一步一步来。上一期我们完成了动态规划二十一天现在我们进行下一项对各类算法进行二十八天的一个小总结。还在等什么快来一起肥学进行二十八天挑战吧！！特别介绍小白练手专栏，适合刚入手的新人欢迎订阅编程小白进阶python有趣练手项目里面包括了像《机器人尬聊》《恶搞程序》这样的有趣文章
回溯算法-重新安排行程 chirou_ 算法数据结构图论 c++图搜索
leetcode332.重新安排行程这题我还没自己ac过，只能现在凭着刚学完的热乎劲把我对题解的理解记下来。本题我认为对数据结构的考察比较多，用什么数据结构去存数据，去读取数据，都是很重要的。classSolution{private:unordered_map>targets;boolbacktracking(intticketNum,vector&result){//1.确定参数和返回值//2
Python 实现图片裁剪（附代码） | Python工具剑客阿良_ALiang
前言本文提供将图片按照自定义尺寸进行裁剪的工具方法，一如既往的实用主义。环境依赖ffmpeg环境安装，可以参考我的另一篇文章：windowsffmpeg安装部署_阿良的博客-CSDN博客本文主要使用到的不是ffmpeg，而是ffprobe也在上面这篇文章中的zip包中。ffmpy安装：pipinstallffmpy-ihttps://pypi.douban.com/simple代码不废话了，上代码
Faiss：高效相似性搜索与聚类的利器网络·魚大数据 faiss
Faiss是一个针对大规模向量集合的相似性搜索库，由FacebookAIResearch开发。它提供了一系列高效的算法和数据结构，用于加速向量之间的相似性搜索，特别是在大规模数据集上。本文将介绍Faiss的原理、核心功能以及如何在实际项目中使用它。Faiss原理：近似最近邻搜索：Faiss的核心功能之一是近似最近邻搜索，它能够高效地在大规模数据集中找到与给定查询向量最相似的向量。这种搜索是近似的，
【无标题】达瓦达瓦 JhonKI 考研
博客主页：https://blog.csdn.net/2301_779549673欢迎点赞收藏⭐留言如有错误敬请指正！本文由JohnKi原创，首发于CSDN未来很长，值得我们全力奔赴更美好的生活✨文章目录前言111️‍111❤️111111111111111总结111前言111骗骗流量券，嘿嘿111111111111111111111111111️‍111❤️111111111111111总结11
上图为是否色发 JhonKI 考研
博客主页：https://blog.csdn.net/2301_779549673欢迎点赞收藏⭐留言如有错误敬请指正！本文由JohnKi原创，首发于CSDN未来很长，值得我们全力奔赴更美好的生活✨文章目录前言111️‍111❤️111111111111111总结111前言111骗骗流量券，嘿嘿111111111111111111111111111️‍111❤️111111111111111总结11
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
143234234123432 JhonKI 考研
博客主页：https://blog.csdn.net/2301_779549673欢迎点赞收藏⭐留言如有错误敬请指正！本文由JohnKi原创，首发于CSDN未来很长，值得我们全力奔赴更美好的生活✨文章目录前言111️‍111❤️111111111111111总结111前言111骗骗流量券，嘿嘿111111111111111111111111111️‍111❤️111111111111111总结11
insert into select 主键自增_mybatis拦截器实现主键自动生成 weixin_39521651 insert into select 主键自增 mybatis delete返回值 mybatis insert返回主键 mybatis insert返回对象 mybatis plus insert返回主键 mybatis plus 插入生成id
前言前阵子和朋友聊天，他说他们项目有个需求，要实现主键自动生成，不想每次新增的时候，都手动设置主键。于是我就问他，那你们数据库表设置主键自动递增不就得了。他的回答是他们项目目前的id都是采用雪花算法来生成，因此为了项目稳定性，不会切换id的生成方式。朋友问我有没有什么实现思路，他们公司的orm框架是mybatis，我就建议他说，不然让你老大把mybatis切换成mybatis-plus。mybat
k均值聚类算法考试例题_k均值算法(k均值聚类算法计算题) 寻找你83497 k均值聚类算法考试例题
?算法：第一步：选K个初始聚类中心，z1(1),z2(1)，…，zK(1)，其中括号内的序号为寻找聚类中心的迭代运算的次序号。聚类中心的向量值可任意设定，例如可选开始的K个.k均值聚类：---------一种硬聚类算法，隶属度只有两个取值0或1，提出的基本根据是“类内误差平方和最小化”准则；模糊的c均值聚类算法：--------一种模糊聚类算法，是.K均值聚类算法是先随机选取K个对象作为初始的聚类
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
ExpRe[25] bash外的其它shell：zsh和fish tritone ExpRe bash linux ubuntu shell
文章目录zsh基础配置实用特性插件`autojump`语法高亮自动补全fish优点缺点时效性本篇撰写时间为2021.12.15，由于计算机技术日新月异，博客中所有内容都有时效和版本限制，具体做法不一定总行得通，链接可能改动失效，各种软件的用法可能有修改。但是其中透露的思想往往是值得学习的。本篇前置：ExpRe[10]Ubuntu[2]准备神秘软件、备份恢复软件https://www.cnblogs
一文掌握python面向对象魔术方法（二）程序员neil python python 开发语言
接上篇：一文掌握python面向对象魔术方法（一）-CSDN博客目录六、迭代和序列化：1、__iter__(self):定义迭代器，使得类可以被for循环迭代。2、__getitem__(self,key):定义索引操作，如obj[key]。3、__setitem__(self,key,value):定义赋值操作，如obj[key]=value。4、__delitem__(self,key):定义
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
推荐算法_隐语义-梯度下降 _feivirus_ 算法机器学习和数学推荐算法机器学习隐语义
importnumpyasnp1.模型实现"""inputrate_matrix:M行N列的评分矩阵，值为P*Q.P:初始化用户特征矩阵M*K.Q:初始化物品特征矩阵K*N.latent_feature_cnt:隐特征的向量个数max_iteration:最大迭代次数alpha:步长lamda:正则化系数output分解之后的P和Q"""defLFM_grad_desc(rate_matrix,l
K近邻算法_分类鸢尾花数据集 _feivirus_ 算法机器学习和数学分类机器学习 K近邻
importnumpyasnpimportpandasaspdfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_score1.数据预处理iris=load_iris()df=pd.DataFrame(data=ir
数据结构 | 栈和队列 TT-Kun 数据结构与算法数据结构栈队列 C语言
文章目录栈和队列1.栈：后进先出（LIFO）的数据结构1.1概念与结构1.2栈的实现2.队列：先进先出（FIFO）的数据结构2.1概念与结构2.2队列的实现3.栈和队列算法题3.1有效的括号3.2用队列实现栈3.3用栈实现队列3.4设计循环队列结论栈和队列在计算机科学中，栈和队列是两种基本且重要的数据结构，它们在处理数据存储和访问顺序方面有着独特的规则和应用。本文将详细介绍栈和队列的概念、结构、实
[Python] 数据结构详解及代码 AIAdvocate 算法 python 数据结构链表
今日内容大纲介绍数据结构介绍列表链表1.数据结构和算法简介程序大白话翻译,程序=数据结构+算法数据结构指的是存储,组织数据的方式.算法指的是为了解决实际业务问题而思考思路和方法,就叫:算法.2.算法的5大特性介绍算法具有独立性算法是解决问题的思路和方式,最重要的是思维,而不是语言,其(算法)可以通过多种语言进行演绎.5大特性有输入,需要传入1或者多个参数有输出,需要返回1个或者多个结果有穷性,执行
Armv8.3 体系结构扩展--原文版代码改变世界ctw ARM-TEE-Android armv8 嵌入式 arm架构安全架构芯片 Trustzone Secureboot
快速链接:.ARMv8/ARMv9架构入门到精通-[目录]付费专栏-付费课程【购买须知】:个人博客笔记导读目录(全部)TheArmv8.3architectureextensionTheArmv8.3architectureextensionisanextensiontoArmv8.2.Itaddsmandatoryandoptionalarchitecturalfeatures.Somefeat
计算机木马详细编写思路小熊同学哦 php 开发语言木马木马思路
导语：计算机木马（ComputerTrojan）是一种恶意软件，通过欺骗用户从而获取系统控制权限，给黑客打开系统后门的一种手段。虽然木马的存在给用户和系统带来严重的安全风险，但是了解它的工作原理与编写思路，对于我们提高防范意识、构建更健壮的网络安全体系具有重要意义。本篇博客将深入剖析计算机木马的详细编写思路，以及如何复杂化挑战，以期提高读者对计算机木马的认识和对抗能力。计算机木马的基本原理计算机木
Python算法L5：贪心算法小熊同学哦 Python算法算法 python 贪心算法
Python贪心算法简介目录Python贪心算法简介贪心算法的基本步骤贪心算法的适用场景经典贪心算法问题1.**零钱兑换问题**2.**区间调度问题**3.**背包问题**贪心算法的优缺点优点：缺点：结语贪心算法（GreedyAlgorithm）是一种在每一步选择中都采取当前最优或最优解的算法。它的核心思想是，在保证每一步局部最优的情况下，希望通过贪心选择达到全局最优解。虽然贪心算法并不总能得到全
Python入门之Lesson2:Python基础语法小熊同学哦 Python入门课程 python 开发语言算法数据结构青少年编程
目录前言一.介绍1.变量和数据类型2.常见运算符3.输入输出4.条件语句5.循环结构二.练习三.总结前言欢迎来到《Python入门》系列博客的第二课。在上一课中，我们了解了Python的安装及运行环境的配置。在这一课中，我们将深入学习Python的基础语法，这是编写Python代码的根基。通过本节内容的学习，你将掌握变量、数据类型、运算符、输入输出、条件语句等Python编程的基础知识。一.介绍1
web报表工具FineReport常见的数据集报错错误代码和解释老A不折腾 web报表 finereport 代码可视化工具
在使用finereport制作报表，若预览发生错误，很多朋友便手忙脚乱不知所措了，其实没什么，只要看懂报错代码和含义，可以很快的排除错误，这里我就分享一下finereport的数据集报错错误代码和解释，如果有说的不准确的地方，也请各位小伙伴纠正一下。 NS-war-remote=错误代码\:1117 压缩部署不支持远程设计 NS_LayerReport_MultiDs=错误代码
Java的WeakReference与WeakHashMap bylijinnan java 弱引用
首先看看 WeakReference wiki 上 Weak reference 的一个例子： public class ReferenceTest { public static void main(String[] args) throws InterruptedException { WeakReference r = new Wea
Linux——（hostname）主机名与ip的映射 eksliang linux hostname
一、什么是主机名无论在局域网还是INTERNET上，每台主机都有一个IP地址，是为了区分此台主机和彼台主机，也就是说IP地址就是主机的门牌号。但IP地址不方便记忆，所以又有了域名。域名只是在公网（INtERNET)中存在，每个域名都对应一个IP地址，但一个IP地址可有对应多个域名。域名类型 linuxsir.org 这样的；主机名是用于什么的呢？答：在一个局域网中，每台机器都有一个主
oracle 常用技巧 18289753290
oracle常用技巧 ①复制表结构和数据 create table temp_clientloginUser as select distinct userid from tbusrtloginlog ②仅复制数据如果表结构一样 insert into mytable select * &nb
使用c3p0数据库连接池时出现com.mchange.v2.resourcepool.TimeoutException 酷的飞上天空 exception
有一个线上环境使用的是c3p0数据库，为外部提供接口服务。最近访问压力增大后台tomcat的日志里面频繁出现 com.mchange.v2.resourcepool.TimeoutException: A client timed out while waiting to acquire a resource from com.mchange.v2.resourcepool.BasicResou
IT系统分析师如何学习大数据蓝儿唯美大数据
我是一名从事大数据项目的IT系统分析师。在深入这个项目前需要了解些什么呢？学习大数据的最佳方法就是先从了解信息系统是如何工作着手，尤其是数据库和基础设施。同样在开始前还需要了解大数据工具，如Cloudera、Hadoop、Spark、Hive、Pig、Flume、Sqoop与Mesos。系统分析师需要明白如何组织、管理和保护数据。在市面上有几十款数据管理产品可以用于管理数据。你的大数据数据库可能
spring学习——简介 a-john spring
Spring是一个开源框架，是为了解决企业应用开发的复杂性而创建的。Spring使用基本的JavaBean来完成以前只能由EJB完成的事情。然而Spring的用途不仅限于服务器端的开发，从简单性，可测试性和松耦合的角度而言，任何Java应用都可以从Spring中受益。其主要特征是依赖注入、AOP、持久化、事务、SpringMVC以及Acegi Security 为了降低Java开发的复杂性，
自定义颜色的xml文件 aijuans xml
<?xml version="1.0" encoding="utf-8"?> <resources> <color name="white">#FFFFFF</color> <color name="black">#000000</color> &
运营到底是做什么的？ aoyouzi 运营到底是做什么的？
文章来源：夏叔叔（微信号：woshixiashushu），欢迎大家关注！很久没有动笔写点东西，近些日子，由于爱狗团产品上线，不断面试，经常会被问道一个问题。问：爱狗团的运营主要做什么？答：带着用户一起嗨。为什么是带着用户玩起来呢？究竟什么是运营？运营到底是做什么的？那么，我们先来回答一个更简单的问题——互联网公司对运营考核什么？以爱狗团为例，绝大部分的移动互联网公司，对运营部门的考核分为三块——用
js面向对象类和对象百合不是茶 js 面向对象函数创建类和对象
接触js已经有几个月了,但是对js的面向对象的一些概念根本就是模糊的,js是一种面向对象的语言但又不像java一样有class,js不是严格的面向对象语言 ,js在java web开发的地位和java不相上下 ,其中web的数据的反馈现在主流的使用json,json的语法和js的类和属性的创建相似下面介绍一些js的类和对象的创建的技术一:类和对
web.xml之资源管理对象配置 resource-env-ref bijian1013 java web.xml servlet
resource-env-ref元素来指定对管理对象的servlet引用的声明，该对象与servlet环境中的资源相关联 <resource-env-ref> <resource-env-ref-name>资源名</resource-env-ref-name> <resource-env-ref-type>查找资源时返回的资源类
Create a composite component with a custom namespace sunjing
https://weblogs.java.net/blog/mriem/archive/2013/11/22/jsf-tip-45-create-composite-component-custom-namespace When you developed a composite component the namespace you would be seeing would
【MongoDB学习笔记十二】Mongo副本集服务器角色之Arbiter bit1129 mongodb
一、复本集为什么要加入Arbiter这个角色回答这个问题，要从复本集的存活条件和Aribter服务器的特性两方面来说。什么是Artiber？ An arbiter does not have a copy of data set and cannot become a primary. Replica sets may have arbiters to add a
Javascript开发笔记白糖_ JavaScript
获取iframe内的元素通常我们使用window.frames["frameId"].document.getElementById("divId").innerHTML这样的形式来获取iframe内的元素，这种写法在IE、safari、chrome下都是通过的，唯独在fireforx下不通过。其实jquery的contents方法提供了对if
Web浏览器Chrome打开一段时间后，运行alert无效 bozch Web chorme alert 无效
今天在开发的时候，突然间发现alert在chrome浏览器就没法弹出了，很是怪异。试了试其他浏览器，发现都是没有问题的。开始想以为是chorme浏览器有啥机制导致的，就开始尝试各种代码让alert出来。尝试结果是仍然没有显示出来。这样开发的结果，如果客户在使用的时候没有提示，那会带来致命的体验。哎，没啥办法了就关闭浏览器重启。结果就好了，这也太怪异了。难道是cho
编程之美-高效地安排会议图着色问题贪心算法 bylijinnan 编程之美
import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.Random; public class GraphColoringProblem { /**编程之美高效地安排会议图着色问题贪心算法 * 假设要用很多个教室对一组
机器学习相关概念和开发工具 chenbowen00 算法 matlab 机器学习
基本概念：机器学习(Machine Learning, ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。它是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域，它主要使用归纳、综合而不是演绎。开发工具 M
[宇宙经济学]关于在太空建立永久定居点的可能性 comsci 经济
大家都知道,地球上的房地产都比较昂贵,而且土地证经常会因为新的政府的意志而变幻文本格式........ 所以,在地球议会尚不具有在太空行使法律和权力的力量之前,我们外太阳系统的友好联盟可以考虑在地月系的某些引力平衡点上面,修建规模较大的定居点
oracle 11g database control 证书错误 daizj oracle 证书错误 oracle 11G 安装
oracle 11g database control 证书错误 win7 安装完oracle11后打开 Database control 后，会打开em管理页面，提示证书错误，点“继续浏览此网站”，还是会继续停留在证书错误页面解决办法：是 KB2661254 这个更新补丁引起的，它限制了 RSA 密钥位长度少于 1024 位的证书的使用。具体可以看微软官方公告：
Java I/O之用FilenameFilter实现根据文件扩展名删除文件游其是你 FilenameFilter
在Java中，你可以通过实现FilenameFilter类并重写accept(File dir, String name) 方法实现文件过滤功能。在这个例子中，我们向你展示在“c:\\folder”路径下列出所有“.txt”格式的文件并删除。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
C语言数组的简单以及一维数组的简单排序算法示例，二维数组简单示例 dcj3sjt126com c array
# include <stdio.h> int main(void) { int a[5] = {1, 2, 3, 4, 5}; //a 是数组的名字 5是表示数组元素的个数，并且这五个元素分别用a[0], a[1]...a[4] int i; for (i=0; i<5; ++i) printf("%d\n",
PRIMARY, INDEX, UNIQUE 这3种是一类 PRIMARY 主键。就是唯一且不能为空。 INDEX 索引，普通的 UNIQUE 唯一索引 dcj3sjt126com primary
PRIMARY, INDEX, UNIQUE 这3种是一类PRIMARY 主键。就是唯一且不能为空。INDEX 索引，普通的UNIQUE 唯一索引。不允许有重复。FULLTEXT 是全文索引，用于在一篇文章中，检索文本信息的。举个例子来说，比如你在为某商场做一个会员卡的系统。这个系统有一个会员表有下列字段：会员编号 INT会员姓名
java集合辅助类 Collections、Arrays shuizhaosi888 Collections Arrays HashCode
Arrays、Collections 1 ）数组集合之间转换 public static <T> List<T> asList(T... a) { return new ArrayList<>(a); } a）Arrays.asL
Spring Security（10）——退出登录logout 234390216 logout Spring Security 退出登录 logout-url LogoutFilter
要实现退出登录的功能我们需要在http元素下定义logout元素，这样Spring Security将自动为我们添加用于处理退出登录的过滤器LogoutFilter到FilterChain。当我们指定了http元素的auto-config属性为true时logout定义是会自动配置的，此时我们默认退出登录的URL为“/j_spring_secu
透过源码学前端之 Backbone 三 Model 逐行分析JS源代码 backbone 源码分析 js学习
Backbone 分析第三部分 Model 概述： Model 提供了数据存储，将数据以JSON的形式保存在 Model的 attributes里，但重点功能在于其提供了一套功能强大，使用简单的存、取、删、改数据方法，并在不同的操作里加了相应的监听事件，如每次修改添加里都会触发 change，这在据模型变动来修改视图时很常用，并且与collection建立了关联。
SpringMVC源码总结（七）mvc:annotation-driven中的HttpMessageConverter 乒乓狂魔 springMVC
这一篇文章主要介绍下HttpMessageConverter整个注册过程包含自定义的HttpMessageConverter，然后对一些HttpMessageConverter进行具体介绍。 HttpMessageConverter接口介绍： public interface HttpMessageConverter<T> { /** * Indicate
分布式基础知识和算法理论 bluky999 算法 zookeeper 分布式一致性哈希 paxos
分布式基础知识和算法理论 BY [email protected] 本文永久链接：http://nodex.iteye.com/blog/2103218 在大数据的背景下，不管是做存储，做搜索，做数据分析，或者做产品或服务本身，面向互联网和移动互联网用户，已经不可避免地要面对分布式环境。笔者在此收录一些分布式相关的基础知识和算法理论介绍，在完善自我知识体系的同
Android Studio的.gitignore以及gitignore无效的解决 bell0901 android gitignore
　　github上.gitignore模板合集，里面有各种.gitignore ： https://github.com/github/gitignore 　　自己用的Android Studio下项目的.gitignore文件，对github上的android.gitignore添加了　　　　　　# OSX files　　　　　　//mac os下　　　　　　.DS_Store
成为高级程序员的10个步骤 tomcat_oracle 编程
What 软件工程师的职业生涯要历经以下几个阶段：初级、中级，最后才是高级。这篇文章主要是讲如何通过 10 个步骤助你成为一名高级软件工程师。 Why 得到更多的报酬！因为你的薪水会随着你水平的提高而增加提升你的职业生涯。成为了高级软件工程师之后，就可以朝着架构师、团队负责人、CTO 等职位前进历经更大的挑战。随着你的成长，各种影响力也会提高。
mongdb在linux下的安装 xtuhcy mongodb linux
一、查询linux版本号： lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noa