基于scikit-learn的文本特征提取和特征选择

一、参考资料

        scikit-learn的特点见:https://www.leiphone.com/news/201701/ZJMTak4Y8ch3Nwd0.html。说明文档包括在线官网scikit-learn首页、官网的pdf、Apache翻译的scikit-learn中文文档。进到scikit-learn首页,看到的6个部分就对应Scikit-learn的六大基本功能:数据预处理,数据降维,分类,回归,聚类,和模型选择,可以直接点击algorithm查看相应的算法。齐全的算法说明在documentation的User guide里,按照监督、非监督、模型选择、数据处理、数据加载等组织的,另外documentation的Tutorial里给了一些应用的例子,API里给了代码说明。一般来说在用的时候,如果知道了要用的类或函数就可以直接在API里找,查看代码说明,想看代码怎么用在例子中的话可以直接点代码说明里的read more in the User guide。

        对于文本信息进行特征提取和特征选择,进入到scikit-learn官网主页选择Documentation->API,在API Reference页面找到sklearn.feature_extraction和sklearn.feature_selection modules,并可以点Feature extraction和Feature selection section来查看User guide里的使用说明。二者的区别是:

  • 特征提取(feature extraction):可以将文本或图像等任何数据转变成机器可以用等数据特征;
  • 特征选择(feature selection):依据数据所处领域,采用特定的机器学习方法来对已经提取对数据特征进行进一步对处理,以增强数据特征对表征能力。

二、特征提取API

基于scikit-learn的文本特征提取和特征选择_第1张图片

        feature_extraction包括:2个通用特征提取类、5个图像专用特征提取类、4个文本专用特征提取类。在用的时候先初始化一个类的实例,然后用实例调用相应的函数。可以采用feature_extraction.text.CountVectorizer()和feature_extraction.text.TfidfTransformer()来先将原始文档转成词频矩阵、再将词频矩阵转成TF-IDF矩阵的方式,也可以直接用feature_extraction.text.TfidfVectorizer()直接将原始文档转成TF-IDF矩阵。

三、特征选择

        采用互信息来对特征进行选择,全部特征可以是Count矩阵也可以是TF-IDF矩阵,被选择的特征指的是字典里的词,只有当该词大于一定的互信息时才会保留下来。互信息衡量的是“该词在各篇文章中TF-IDF值组成的向量”与“各篇文章的类别”的相关程度,互信息越大越好。互信息筛选用的是sklearn.metrics.mutual_info_score(labels, x),也可以用特征选择里的sklearn.feature_selection.mutual_info_classif(Xy)

四、代码

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

corpus=["I come to China to travel",
    "This is a car polupar in China",
    "I love tea and Apple ",
    "The work is to write some papers in science"]

# 词袋模型
vectorizer=CountVectorizer()
count = vectorizer.fit_transform(corpus)
count_matrix = count.toarray() #才会变成矩阵的形式
print(count_matrix) #词袋模型特征:行数是文章数量,列数是字典size,每个元素代表该文章中该词出现的次数
vocab = vectorizer.get_feature_names()#字典
for i in range(len(count_matrix)):#打印每篇文章中字典各词出现的次数,第一个for遍历所有文章,第二个for遍历字典各词
    print(u"-------这里输出第",i,u"篇文章中字典各词的词频------")
    for j in range(len(vocab)):
        print(vocab[j], count_matrix[i][j])

# TF-IDF特征
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(count_matrix)
tfidf_matrix = tfidf.toarray() # TF-IDF特征:行数是文章数量,列数是字典size,每个元素代表改文章中该词的TF-IDF权重
print(tfidf_matrix)
for i in range(len(tfidf_matrix)):
    print(u"-------这里输出第",i,u"篇文章中字典各词的TF-IDF权重------")
    for j in range(len(vocab)):
        print(vocab[j], tfidf_matrix[i][j])

# 用互信息进行特征选择
X_MI = {}  # 用于存储X_matrix特征与label之间的互信息值
from sklearn import metrics as mr
import numpy as np
X_labels = np.array([1,1,2,3]) # 假设四篇文章分别对应这三种类型
tfidf_matrix_T = tfidf_matrix.T # 让每一行变成是字典里的某一个词在各篇文章中的tf-idf值,通过计算每行(4个元素)与文章类型(4个元素)的互信息,来决定要不要该词(也即特征)
for i in range(tfidf_matrix_T.shape[0]):
    X_MI[i] = mr.mutual_info_score(X_labels, tfidf_matrix_T[i])
# 筛选特征,根据互信息的大小进行从大到小排序,留下互信息前10的词(特征)
X_MI_filtered = [word for word in X_MI if X_MI[word] >= sorted(X_MI.values(),reverse=True)[10]]
print(u'选择的特征对应列序号:', X_MI_filtered)
tfidf_matrix_filtered_by_mutualinfo_T = [tfidf_matrix_T[word] for word in X_MI_filtered]
tfidf_matrix_filtered_by_mutualinfo = np.array(tfidf_matrix_filtered_by_mutualinfo_T).T
print(u"经过互信息筛选过的词组成的TF-IDF矩阵:\n", tfidf_matrix_filtered_by_mutualinfo)

输出:

/Users/gaoxuanxuan/anaconda3/envs/tensorflow/bin/python /Users/gaoxuanxuan/PycharmProjects/NLP/TextClassification/text-classification-cnn-rnn/data/cnews_loader_feature_extraction_and_selection.py
[[0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 0]
 [0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0]
 [1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1]]
-------这里输出第 0 篇文章中字典各词的词频------
and 0
apple 0
car 0
china 1
come 1
in 0
is 0
love 0
papers 0
polupar 0
science 0
some 0
tea 0
the 0
this 0
to 2
travel 1
work 0
write 0
-------这里输出第 1 篇文章中字典各词的词频------
and 0
apple 0
car 1
china 1
come 0
in 1
is 1
love 0
papers 0
polupar 1
science 0
some 0
tea 0
the 0
this 1
to 0
travel 0
work 0
write 0
-------这里输出第 2 篇文章中字典各词的词频------
and 1
apple 1
car 0
china 0
come 0
in 0
is 0
love 1
papers 0
polupar 0
science 0
some 0
tea 1
the 0
this 0
to 0
travel 0
work 0
write 0
-------这里输出第 3 篇文章中字典各词的词频------
and 0
apple 0
car 0
china 0
come 0
in 1
is 1
love 0
papers 1
polupar 0
science 1
some 1
tea 0
the 1
this 0
to 1
travel 0
work 1
write 1
[[0.         0.         0.         0.34884223 0.44246214 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.69768446 0.44246214 0.
  0.        ]
 [0.         0.         0.4533864  0.35745504 0.         0.35745504
  0.35745504 0.         0.         0.4533864  0.         0.
  0.         0.         0.4533864  0.         0.         0.
  0.        ]
 [0.5        0.5        0.         0.         0.         0.
  0.         0.5        0.         0.         0.         0.
  0.5        0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.         0.28113163
  0.28113163 0.         0.35657982 0.         0.35657982 0.35657982
  0.         0.35657982 0.         0.28113163 0.         0.35657982
  0.35657982]]
-------这里输出第 0 篇文章中字典各词的TF-IDF权重------
and 0.0
apple 0.0
car 0.0
china 0.348842231691988
come 0.4424621378947393
in 0.0
is 0.0
love 0.0
papers 0.0
polupar 0.0
science 0.0
some 0.0
tea 0.0
the 0.0
this 0.0
to 0.697684463383976
travel 0.4424621378947393
work 0.0
write 0.0
-------这里输出第 1 篇文章中字典各词的TF-IDF权重------
and 0.0
apple 0.0
car 0.45338639737285463
china 0.3574550433419527
come 0.0
in 0.3574550433419527
is 0.3574550433419527
love 0.0
papers 0.0
polupar 0.45338639737285463
science 0.0
some 0.0
tea 0.0
the 0.0
this 0.45338639737285463
to 0.0
travel 0.0
work 0.0
write 0.0
-------这里输出第 2 篇文章中字典各词的TF-IDF权重------
and 0.5
apple 0.5
car 0.0
china 0.0
come 0.0
in 0.0
is 0.0
love 0.5
papers 0.0
polupar 0.0
science 0.0
some 0.0
tea 0.5
the 0.0
this 0.0
to 0.0
travel 0.0
work 0.0
write 0.0
-------这里输出第 3 篇文章中字典各词的TF-IDF权重------
and 0.0
apple 0.0
car 0.0
china 0.0
come 0.0
in 0.2811316284405006
is 0.2811316284405006
love 0.0
papers 0.3565798233381452
polupar 0.0
science 0.3565798233381452
some 0.3565798233381452
tea 0.0
the 0.3565798233381452
this 0.0
to 0.2811316284405006
travel 0.0
work 0.3565798233381452
write 0.3565798233381452
选择的特征对应列序号: [0, 1, 3, 5, 6, 7, 8, 10, 11, 12, 13, 15, 17, 18]
经过互信息筛选过的词组成的TF-IDF矩阵:
 [[0.         0.         0.34884223 0.         0.         0.
  0.         0.         0.         0.         0.         0.69768446
  0.         0.        ]
 [0.         0.         0.35745504 0.35745504 0.35745504 0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.5        0.5        0.         0.         0.         0.5
  0.         0.         0.         0.5        0.         0.
  0.         0.        ]
 [0.         0.         0.         0.28113163 0.28113163 0.
  0.35657982 0.35657982 0.35657982 0.         0.35657982 0.28113163
  0.35657982 0.35657982]]

Process finished with exit code 0

 

你可能感兴趣的:(NLP,Scikit-learn)