萌新待开发

进阶篇

模型实用技巧

特征提升

特征抽取

特征筛选

模型正则化

欠拟合与过拟合

L1范数正则化

L2范数正则化

模型检测

留一验证

交叉验证

超参数搜索

网格搜索

并行搜索

流行库/模型实践

自然语言处理包（NLTK）

词向量(Word2Vec)技术

XGBoost模型

Tensorflow框架

模型实用技巧

特征提升

特征抽取

DicVectorizer对使用字典存储的数据进行特征抽取与向量化

#定义一组字典列表，用来表示多个数据样本（每个字典代表一个数据样本）
measurements = [{'city':'Dubai','temperature':33.},{'city':'London','temperature':12.},{'city':'San Fransisco','temperature':18.}]
#从 sklearn.feature_extravtion 导入 DictVectorizer
from sklearn.feature_extraction import DictVectorizer
#初始化 DictVectorizer 特征抽取器
vec = DictVectorizer()
#输出转化之后的特征矩阵
print(len(measurements))
print(vec.fit_transform(measurements))
print(vec.fit_transform(measurements).toarray())
#输出各个维度的特征含义
print(vec.get_feature_names())

结果：
3
  (0, 0)	1.0
  (0, 3)	33.0
  (1, 1)	1.0
  (1, 3)	12.0
  (2, 2)	1.0
  (2, 3)	18.0
[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 12.]
 [ 0.  0.  1. 18.]]
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

使用 CountVectorizer 并且不去掉停用词的条件下，对文本特征进行量化的朴素贝叶斯分类性能测试

#从 skearn.datasets 里导入20类新闻文本数据抓取器
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#从互联网上即使下载新闻样本，subset='all'参数代表下载全部近2万条文本存储在变量news中
news = fetch_20newsgroups(subset='all')
#从 sklearn.model_selection 导入 train_test_split 模块用于分割数据集
#对 news 中的数据 data 进行分割，25%的文本用作测试集，75%作为训练集
X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)

#从 sklearn.feature_extraction.text 里导入 CountVectorizer
#采用默认的配置对 CountVectorizer 进行初始化（默认配置不去除英文停用词），并且赋值给变量 count_vec
cout_vec = CountVectorizer()

#只使用词频率统计的方式将原始训练和测试文本转化为特征向量
X_count_train = cout_vec.fit_transform(X_train)
X_count_test = cout_vec.transform(X_test)
#从 sklearn.naive_bayes 里导入朴素叶贝斯分类器
#使用默认的配置对分类器进行初始化
mnb_count = MultinomialNB()
#使用朴素叶贝斯分类器，对 CountVectorizer(不去除停用词)后的训练样本进行参数学习
mnb_count.fit(X_count_train,y_train)

#输出模型准确性结果
print("The accuracy of classifying 20newsgroups using Naive Bayes (CountVecorizer without filtering stopwords):",mnb_count.score(X_count_test,y_test))
#将分类预测的结果存储在变量 y_zount_predict 中
y_count_predict = mnb_count.predict(X_count_test)
#从 sklearn.metrics 导入 classification_report
#输出更加详细的其他评价分类性能的指标
print(classification_report(y_test,y_count_predict,target_names=news.target_names))

结果：
The accuracy of classifying 20newsgroups using Naive Bayes (CountVecorizer without filtering stopwords): 0.8397707979626485
                          precision    recall  f1-score   support

             alt.atheism       0.86      0.86      0.86       201
           comp.graphics       0.59      0.86      0.70       250
 comp.os.ms-windows.misc       0.89      0.10      0.17       248
comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
   comp.sys.mac.hardware       0.93      0.78      0.85       242
          comp.windows.x       0.82      0.84      0.83       263
            misc.forsale       0.91      0.70      0.79       257
               rec.autos       0.89      0.89      0.89       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.98      0.91      0.95       251
        rec.sport.hockey       0.93      0.99      0.96       233
               sci.crypt       0.86      0.98      0.91       238
         sci.electronics       0.85      0.88      0.86       249
                 sci.med       0.92      0.94      0.93       245
               sci.space       0.89      0.96      0.92       221
  soc.religion.christian       0.78      0.96      0.86       232
      talk.politics.guns       0.88      0.96      0.92       251
   talk.politics.mideast       0.90      0.98      0.94       231
      talk.politics.misc       0.79      0.89      0.84       188
      talk.religion.misc       0.93      0.44      0.60       158

                accuracy                           0.84      4712
               macro avg       0.86      0.84      0.82      4712
            weighted avg       0.86      0.84      0.82      4712

使用 TfidfVectorizer并且不去掉停用词的条件下，对文本特征进行量化的朴素贝叶斯分类性能测试

#从 skearn.datasets 里导入20类新闻文本数据抓取器
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#从互联网上即使下载新闻样本，subset='all'参数代表下载全部近2万条文本存储在变量news中
news = fetch_20newsgroups(subset='all')
#从 sklearn.model_selection 导入 train_test_split 模块用于分割数据集
#对 news 中的数据 data 进行分割，25%的文本用作测试集，75%作为训练集
X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)

#从 sklearn.feature_extraction.text 里导入 TfidfVectorizer
#采用默认的配置对 TfidfVectorizer 进行初始化（默认配置不去除英文停用词），并且赋值给变量 count_vec
tfidf_vec = TfidfVectorizer()

#只使用词频率统计的方式将原始训练和测试文本转化为特征向量
X_tfidf_train = tfidf_vec.fit_transform(X_train)
X_tfidf_test = tfidf_vec.transform(X_test)
#从 sklearn.naive_bayes 里导入朴素叶贝斯分类器
#使用默认的配置对分类器进行初始化
mnb_tfidf = MultinomialNB()
#使用朴素叶贝斯分类器，对 TfidfVectorizer(不去除停用词)后的训练样本进行参数学习
mnb_tfidf.fit(X_tfidf_train,y_train)

#输出模型准确性结果
print("The accuracy of classifying 20newsgroups using Naive Bayes (TfidfVectorizer without filtering stopwords):",mnb_tfidf.score(X_tfidf_test,y_test))
#将分类预测的结果存储在变量 y_zount_predict 中
y_tfidf_predict = mnb_tfidf.predict(X_tfidf_test)
#从 sklearn.metrics 导入 classification_report
#输出更加详细的其他评价分类性能的指标
print(classification_report(y_test,y_tfidf_predict,target_names=news.target_names))

结果：
The accuracy of classifying 20newsgroups using Naive Bayes (TfidfVectorizer without filtering stopwords): 0.8463497453310697
                          precision    recall  f1-score   support

             alt.atheism       0.84      0.67      0.75       201
           comp.graphics       0.85      0.74      0.79       250
 comp.os.ms-windows.misc       0.82      0.85      0.83       248
comp.sys.ibm.pc.hardware       0.76      0.88      0.82       240
   comp.sys.mac.hardware       0.94      0.84      0.89       242
          comp.windows.x       0.96      0.84      0.89       263
            misc.forsale       0.93      0.69      0.79       257
               rec.autos       0.84      0.92      0.88       238
         rec.motorcycles       0.98      0.92      0.95       276
      rec.sport.baseball       0.96      0.91      0.94       251
        rec.sport.hockey       0.88      0.99      0.93       233
               sci.crypt       0.73      0.98      0.83       238
         sci.electronics       0.91      0.83      0.87       249
                 sci.med       0.97      0.92      0.95       245
               sci.space       0.89      0.96      0.93       221
  soc.religion.christian       0.51      0.97      0.67       232
      talk.politics.guns       0.83      0.96      0.89       251
   talk.politics.mideast       0.92      0.97      0.95       231
      talk.politics.misc       0.98      0.62      0.76       188
      talk.religion.misc       0.93      0.16      0.28       158

                accuracy                           0.85      4712
               macro avg       0.87      0.83      0.83      4712
            weighted avg       0.87      0.85      0.84      4712