本文介绍文本处理时比较常用且有效的tfidf特征提取方法
TF即是词频(Term Frequency)是文本信息量统计方法之一,简单来说就是统计此文本中每个词的出现频率
def computeTF(wordDict, bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count / float(bowCount)
return tfDict
idf即逆向文档频率(Inverse Document Frequency),用来衡量一个词的普遍重要性,一般通过文档总数/包含该词汇的文档数,再取对数得到的值
def computeIDF(docList):
import math
idfDict = {}
N = len(docList)
idfDict = dict.fromkeys(docList[0].keys(), 0)
for doc in docList:
for word, val in doc.items():
if word in idfDict:
if val > 0:
idfDict[word] += 1
else:
if val > 0:
idfDict[word] = 1
for word, val in idfDict.items():
idfDict[word] = math.log10(N / float(val))
return idfDict
tf-idf即是tf * idf所得到的值,可以衡量某个词在所有文档中的信息量。假设有n个词的文档A,某个词的出现次数为t,且该词在w份文档中出现过,总共有x份文件
def computeTFIDF(tfBow, idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val * idfs[word]
return tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
count_vec = TfidfVectorizer(binary=False, decode_error='ignore', stop_words='english')
传入数据进行拟合然后转化为词向量的形式
s1 = 'I love you so much'
s2 = 'I hate you! shit!'
s3 = 'I like you, but just like you'
response = count_vec.fit_transform([s1, s2, s3]) # s must be string
print(count_vec.get_feature_names())
print(response.toarray())
输出去掉英文停用词后的结果如下
[‘shit’, ‘hate’, ‘just’, ‘like’, ‘love’]
[[0. 0. 0. 0. 1. ]
[0.70710678 0.70710678 0. 0. 0. ]
[0. 0. 0.4472136 0.89442719 0. ]]
参考博客:
关于TF(词频) 和TF-IDF(词频-逆向文件频率 )的理解