从基本的内容讲起,我们可以利用TF-IDF来提取文本特征,在python中有TfidfVectorizer这中工具可以方便我们使用,对所有样本分词,并且通过设置N-gram来获得特征,然后以这些词作为维度特征对每个样本向量化,最后到模型中训练,本文主要讲解TfidfVectorizer的使用,如何来使用这个工具。
我们通过python代码来看看这个工具,先上代码看看,都有注释:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer,TfidfVectorizer
# 语料
document = ["I have a pen.",
"I have an apple.",
"you have an apple.",
"an apple."]
# 定义对象
tf = TfidfVectorizer(ngram_range=(1,2),analyzer='word',smooth_idf=1) # 按照 word 来做特征,最大范围是两个单词,最小是一个单词
discuss_tf = tf.fit_transform(document)
#matrix of input set
X = (discuss_tf).toarray()
size_matrix = X.shape[0]
print('size_matrix = ',X.shape)
n_fea = 2;
## 降维之后的特征个数 不得 大于样本个数
svd_tag_tmp = TruncatedSVD(n_components=n_fea, n_iter=20, random_state=2019)
tag_svd_tmp = svd_tag_tmp.fit_transform(discuss_tf)
tag_svd_tmp = pd.DataFrame(tag_svd_tmp)
print(tag_svd_tmp)
tag_svd_tmp.columns = [f'b_svd_{i}' for i in range(n_fea)]
print('- '*10)
words = tf.get_feature_names()
print(words)
print(discuss_tf)
num = len(words)
row_num=len(document)
for i in range(row_num):
print('----Document %d----'%(i))
for j in range(num):
print(words[j],discuss_tf[i,j])
结果:
size_matrix = (4, 9)
0 1
0 0.200471 9.729345e-01
1 0.938638 1.042670e-16
2 0.852996 1.825470e-16
3 0.844057 -2.310809e-01
- - - - - - - - - -
['an', 'an apple', 'apple', 'have', 'have an', 'have pen', 'pen', 'you', 'you have']
(0, 5) 0.6445029922609534
(0, 6) 0.6445029922609534
(0, 3) 0.41137791133379387
(1, 1) 0.42540804913721164
(1, 4) 0.5254635733493682
(1, 2) 0.42540804913721164
(1, 0) 0.42540804913721164
(1, 3) 0.42540804913721164
(2, 8) 0.48500083957081014
(2, 7) 0.48500083957081014
(2, 1) 0.30956975339688264
(2, 4) 0.38238023269828086
(2, 2) 0.30956975339688264
(2, 0) 0.30956975339688264
(2, 3) 0.30956975339688264
(3, 1) 0.5773502691896257
(3, 2) 0.5773502691896257
(3, 0) 0.5773502691896257
----Document 0----
an 0.0
an apple 0.0
apple 0.0
have 0.41137791133379387
have an 0.0
have pen 0.6445029922609534
pen 0.6445029922609534
you 0.0
you have 0.0
----Document 1----
an 0.42540804913721164
an apple 0.42540804913721164
apple 0.42540804913721164
have 0.42540804913721164
have an 0.5254635733493682
have pen 0.0
pen 0.0
you 0.0
you have 0.0
----Document 2----
an 0.30956975339688264
an apple 0.30956975339688264
apple 0.30956975339688264
have 0.30956975339688264
have an 0.38238023269828086
have pen 0.0
pen 0.0
you 0.48500083957081014
you have 0.48500083957081014
----Document 3----
an 0.5773502691896257
an apple 0.5773502691896257
apple 0.5773502691896257
have 0.0
have an 0.0
have pen 0.0
pen 0.0
you 0.0
you have 0.0
大家直接运行代码就可以看到输出结果,这里做一个解释,我们提取所有的文本单词,并计算单词的TF-IDF值,并把输入语料集中的每个句子向量化,这样每个句子的维度就是一样的了,句子不包含的单词,该单词的TF-IDF值就是0,包含就是该单词计算的TF-IDF值,这样就可以把语料都统一了维度,其中ngram_range=(1,2)
表示允许连续的至少一个、最多两个单位组成一个新的特征,因为本代码中是word维度的,所以最多连续的两个单词,至少一个单词,这样就提取了文本的单词特征,但这样提取特征也有很多局限性,在很多的任务中,都是考验语义,比如实体抽取,很多实体并不会出现在文本那种,但是人们一看文本就知道这是描述哪个实体,所以提取文本单词特征对于很多任务并不适用。
本人在阿里巴巴工作,业余时间做了社招、校招的公众号,可以内推大家,免筛选直接面试,公众号的一些文章也帮助大学、研究生的一些同学了解校招、了解名企,工作几年的同学想换工作也可以找我走社招内推,同时大家对文章有问题,也可以公众号找我,扫码关注哦!
Python中的TfidfVectorizer参数解析