一.概述
TF-IDF(英文名: term frequency-inverse document frequency),引用百度百科的说法: TF-IDF是一种用于信息检索与数据挖掘的常用加权技术。
TF意思是词频(Term Frequency),用在句子构成的语料中,就是字或者词在文本中出现的频率。
一般计算是: TF = 字或词在句子中出现的次数 / 字或词在所有语料中出现的次数
IDF意思是逆文本频率指数(Inverse Document Frequency),就是出现该字或者词的句子条数。
一般计算是: IDF = Log ( 语料中句子总数 / (包含该词或字的句子数+1) )
TF-IDF = TF * IDF
这是前文介绍TF-IDF时候的说法,正巧面试也手撸了一次这个算法,真实环境是不是这样呢? 我们一探究竟。
github地址:
https://github.com/yongzhuo/Tookit-Sihui/tree/master/tookit_sample/tf_idf_compare
本文主要介绍4中方案实现tf-idf,它们各有优点:
1.gensim
2.jieba
3.sklearn
4.by_hand(手动)
二.优缺点(推荐 sklearn)
a. gensim: corpora生成token,doc2bow生成词袋模型,tfidf_model计算tf-idf,idfs可给出。
未出现的词语idfs等不计算在内,中规中矩的一个模型,可输入list或者文件地址等。
b. jieba: 有idf.txt,即计算好的idf,未出现的词语使用平均idf(但对于句子来说,尤其是对单个单词的句子很不友好)。
c. sklearn: CountVectorizer统计词频,TfidfTransformer计算tfidf,csr_matrix数据格式压缩,
可选择n-gram特征,可平滑处理,可选features,选择很多,还是推荐这个吧。
d. by_hand: 手工版可配置, 批量计算词频字典与合并,可在小内存下计算大样本,比如说wikicorpus。
三.实现与最后代码说明
2.1 gensim
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/7/31 21:20
# @author :Mo
# @function :
from gensim import corpora, models
import jieba
def tfidf_from_questions(corpora_documents):
"""
从文件读取并计算tf-idf
:param sources_path:
:return:
"""
dictionary = corpora.Dictionary(corpora_documents)
corpus = [dictionary.doc2bow(text) for text in corpora_documents]
tfidf_model = models.TfidfModel(corpus)
return dictionary, tfidf_model
def tfidf_from_corpora(sources_path):
"""
从文件读取并计算tf-idf
:param sources_path:
:return:
"""
from tookit_sihui.utils.file_utils import txt_read, txt_write
questions = txt_read(sources_path)
corpora_documents = []
for item_text in questions:
item_seg = list(jieba.cut(str(item_text).strip()))
corpora_documents.append(item_seg)
dictionary = corpora.Dictionary(corpora_documents)
corpus = [dictionary.doc2bow(text) for text in corpora_documents]
tfidf_model = models.TfidfModel(corpus)
return dictionary, tfidf_model
if __name__ == '__main__':
# test 1 from questions
corpora_documents = [['大漠', '帝国'],['紫色', 'Angle'],['花落', '惊', '飞羽'],
['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this']]
dictionary, tfidf_model = tfidf_from_questions(corpora_documents)
sentence = '大漠 大漠 大漠'
seg = list(jieba.cut(sentence))
bow = dictionary.doc2bow(seg)
tfidf_vec = tfidf_model[bow]
print(bow)
print(tfidf_vec)
bow = dictionary.doc2bow(['i', 'i', '大漠', '大漠', '大漠'])
tfidf_vec = tfidf_model[bow]
print(bow)
print(tfidf_vec)
# test 2 from file of text
from tookit_sihui.conf.path_config import path_tf_idf_corpus
dictionary, tfidf_model = tfidf_from_corpora(path_tf_idf_corpus)
sentence = '大漠帝国'
seg = list(jieba.cut(sentence))
bow = dictionary.doc2bow(seg)
tfidf_vec = tfidf_model[bow]
print(bow)
print(tfidf_vec)
bow = dictionary.doc2bow(['sihui'])
tfidf_vec = tfidf_model[bow]
print(bow)
print(tfidf_vec)
gg = 0
# 结果
# [(12, 1)]
# [(12, 1.0)]
# []
# []
# [(172, 1), (173, 1)]
# [(172, 0.7071067811865475), (173, 0.7071067811865475)]
# []
# []
# # 说明:
# 1.左边的是字典id,右边是词的tfidf,
# 2.中文版停用词(如the)、单个字母(如i)等,不会去掉
# 3.去除没有被训练到的词,如'sihui',没有出现就不会计算
# 4.计算细节
# 4.1 idf = add + log_{log\_base} \frac{totaldocs}{docfreq}, 如下:
# eps = 1e-12, idf只取大于eps的数字
def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
import numpy as np
# np.log()什么都不写就以e为低, 由公式log(a)(b)=log(c)(b)/log(c)(a),
# 可得函数中为log(2)(totaldocs / docfreq)
# debug进去可以发现, 没有进行平滑处理, 即log(2)(文本数 / 词出现在文本中的个数),
# 这也很好理解, 因为如果输入为[],则不会给出模型,出现的文本中的至少出现一次,也没有必要加1了
return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
# 注意self.initialize(corpus)函数
# 4.2 tf 从下面以及debug结果可以发现, gensim的tf取值是词频,
# 也就是说出现几次就取几次,如句子'大漠 大漠 大漠', '大漠'的tf就取3
# termid_array, tf_array = [], []
# for termid, tf in bow:
# termid_array.append(termid)
# tf_array.append(tf)
#
# tf_array = self.wlocal(np.array(tf_array))
#
# vector = [
# (termid, tf * self.idfs.get(termid))
# for termid, tf in zip(termid_array, tf_array)
# if abs(self.idfs.get(termid, 0.0)) > self.eps
# ]
2.2 jieba
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/7/31 21:21
# @author :Mo
# @function :
import jieba.analyse
import jieba
sentence = '大漠 帝国 和 紫色 Angle'
seg = jieba.cut(sentence)
print(seg)
tf_idf = jieba.analyse.extract_tags(sentence, withWeight=True)
print(tf_idf)
# 结果
# [('Angle', 2.988691875725), ('大漠', 2.36158258893), ('紫色', 2.10190405216), ('帝国', 1.605909794915)]
# 说明,
# 1.1 idf jieba中的idf来自默认文件idf.txt,
# idf默认一段话来作为一个docunment,
# 没出现过的词语的idf默认为所有idf的平均值,即为11.多
#
# 1.2 tf tf只统计当前句子出现的频率除以所有词语数,
# 例如'大漠 帝国 和 紫色 Angle'这句话, '大漠'的tf为1/5
# tfidf的停用词"和"去掉了
# tf计算代码
# freq[w] = freq.get(w, 0.0) + 1.0
# total = sum(freq.values())
# for k in freq:
# kw = k.word if allowPOS and withFlag else k
# freq[k] *= self.idf_freq.get(kw, self.median_idf) / total
2.3 sklearn
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/7/31 21:21
# @author :Mo
# @function :
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
def tfidf_from_ngram(questions):
"""
使用TfidfVectorizer计算n-gram
:param questions:list, like ['孩子气', '大漠帝国']
:return:
"""
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba
def jieba_cut(x):
x = list(jieba.cut(x))
return ' '.join(x)
questions = [jieba_cut(''.join(ques)) for ques in questions]
tfidf_model = TfidfVectorizer(ngram_range=(1, 2), # n-gram特征, 默认(1,1)
max_features=10000,
token_pattern=r"(?u)\b\w+\b", # 过滤停用词
min_df=1,
max_df=0.9,
use_idf=1,
smooth_idf=1,
sublinear_tf=1)
tfidf_model.fit(questions)
print(tfidf_model.transform(['紫色 ANGEL 是 虾米 回事']))
return tfidf_model
if __name__ == "__main__":
# test 1
corpora_documents = [['大漠', '帝国'], ['紫色', 'Angle'], ['花落', '惊', '飞羽'],
['我', 'm', 'o'], ['你', 'the', 'a', 'it', 'this'], ['大漠', '大漠']]
corpora_documents = [''.join(ques) for ques in corpora_documents]
# 统计词频
vectorizer = CountVectorizer()
# 初始化,fit和transformer tf-idf
transformer = TfidfTransformer()
# 第一个fit_transform是计算tf-idf, 第二个是将文本转为词频矩阵
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpora_documents))
print(tfidf)
# 模型所有词语
word = vectorizer.get_feature_names()
print(word)
weight = tfidf.toarray()
print(weight)
# test 2 from file of text
tf_idf_model = tfidf_from_ngram(corpora_documents)
print(tf_idf_model.transform(['你 谁 呀, 小老弟']))
# sklearn的tfidf模型,可以采用TfidfVectorizer,提取n-gram特征,直接用于特征计算
# 和gensim一样, 都有TfidfVectorizer, 继承的是CountVectorizer
# df += int(self.smooth_idf) # 平滑处理
# n_samples += int(self.smooth_idf) # 平滑处理
# idf = np.log(n_samples / df) + 1 # 加了个1
2.4 byhand
# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time :2019/6/19 21:32
# @author :Mo
# @function :tf-idf
from tookit_sihui.utils.file_utils import save_json
from tookit_sihui.utils.file_utils import load_json
from tookit_sihui.utils.file_utils import txt_write
from tookit_sihui.utils.file_utils import txt_read
import jieba
import json
import math
import os
from tookit_sihui.conf.logger_config import get_logger_root
logger = get_logger_root()
def count_tf(questions):
"""
统计字频,或者词频tf
:param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
:return: dict, 返回字频,或者词频, 形式:{'我':1, '爱':2}
"""
tf_char = {}
for question in questions:
for char in question:
if char.strip():
if char not in tf_char:
tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = 1
else:
tf_char[str(char).encode('utf-8', 'ignore').decode('utf-8')] = tf_char[char] + 1
tf_char['[LENS]'] = sum([v for k,v in tf_char.items()])
return tf_char
def count_idf(questions):
"""
统计逆文档频率idf
:param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
:return: dict, 返回逆文档频率, 形式:{'我':1, '爱':2}
"""
idf_char = {}
for question in questions:
question_set = set(question) # 在句子中,重复的只计数一次
for char in question_set:
if char.strip(): # ''不统计
if char not in idf_char: # 第一次计数为1
idf_char[char] = 1
else:
idf_char[char] = idf_char[char] + 1
idf_char['[LENS]'] = len(questions) # 保存一个所有的句子长度
return idf_char
def count_tf_idf(freq_char, freq_document, ndigits=12, smooth =0):
"""
统计tf-idf
:param freq_char: dict, tf
:param freq_document: dict, idf
:return: dict, tf-idf
"""
len_tf = freq_char['[LENS]']
len_tf_mid = int(len(freq_char)/2)
len_idf = freq_document['[LENS]']
len_idf_mid = int(len(freq_document) / 2)
# tf
tf_char = {}
for k2, v2 in freq_char.items():
tf_char[k2] = round((v2 + smooth)/(len_tf + smooth), ndigits)
# idf
idf_char = {}
for ki, vi in freq_document.items():
idf_char[ki] = round(math.log((len_idf + smooth) / (vi + smooth), 2), ndigits)
# tf-idf
tf_idf_char = {}
for kti, vti in freq_char.items():
tf_idf_char[kti] = round(tf_char[kti] * idf_char[kti], ndigits)
# 删去文档数统计
tf_char.pop('[LENS]')
idf_char.pop('[LENS]')
tf_idf_char.pop('[LENS]')
# 计算平均/最大/中位数
tf_char_values = tf_char.values()
idf_char_values = idf_char.values()
tf_idf_char_values = tf_idf_char.values()
tf_char['[AVG]'] = round(sum(tf_char_values) / len_tf, ndigits)
idf_char['[AVG]'] = round(sum(idf_char_values) / len_idf, ndigits)
tf_idf_char['[AVG]'] = round(sum(tf_idf_char_values) / len_idf, ndigits)
tf_char['[MAX]'] = max(tf_char_values)
idf_char['[MAX]'] = max(idf_char_values)
tf_idf_char['[MAX]'] = max(tf_idf_char_values)
tf_char['[MIN]'] = min(tf_char_values)
idf_char['[MIN]'] = min(idf_char_values)
tf_idf_char['[MIN]'] = min(tf_idf_char_values)
tf_char['[MID]'] = sorted(tf_char_values)[len_tf_mid]
idf_char['[MID]'] = sorted(idf_char_values)[len_idf_mid]
tf_idf_char['[MID]'] = sorted(tf_idf_char_values)[len_idf_mid]
return tf_char, idf_char, tf_idf_char
def save_tf_idf_dict(path_dir, tf_char, idf_char, tf_idf_char):
"""
排序和保存
:param path_dir:str, 保存文件目录
:param tf_char: dict, tf
:param idf_char: dict, idf
:param tf_idf_char: dict, tf-idf
:return: None
"""
if not os.path.exists(path_dir):
os.mkdir(path_dir)
# store and save
tf_char_sorted = sorted(tf_char.items(), key=lambda d: d[1], reverse=True)
tf_char_sorted = [tf[0] + '\t' + str(tf[1]) + '\n' for tf in tf_char_sorted]
txt_write(tf_char_sorted, path_dir + 'tf.txt')
idf_char_sorted = sorted(idf_char.items(), key=lambda d: d[1], reverse=True)
idf_char_sorted = [idf[0] + '\t' + str(idf[1]) + '\n' for idf in idf_char_sorted]
txt_write(idf_char_sorted, path_dir + 'idf.txt')
tf_idf_char_sorted = sorted(tf_idf_char.items(), key=lambda d: d[1], reverse=True)
tf_idf_char_sorted = [tf_idf[0] + '\t' + str(tf_idf[1]) + '\n' for tf_idf in tf_idf_char_sorted]
txt_write(tf_idf_char_sorted, path_dir + 'tf_idf.txt')
def save_tf_idf_json(path_dir, tf_freq, idf_freq, tf_char, idf_char, tf_idf_char):
"""
json排序和保存
:param path_dir:str, 保存文件目录
:param tf_char: dict, tf
:param idf_char: dict, idf
:param tf_idf_char: dict, tf-idf
:return: None
"""
if not os.path.exists(path_dir):
os.mkdir(path_dir)
# freq
save_json([tf_freq], path_dir + '/tf_freq.json')
save_json([idf_freq], path_dir + '/idf_freq.json')
# json_tf = json.dumps([tf_char])
save_json([tf_char], path_dir + '/tf.json')
# json_idf = json.dumps([idf_char])
save_json([idf_char], path_dir + '/idf.json')
# json_tf_idf = json.dumps([tf_idf_char])
save_json([tf_idf_char], path_dir + '/tf_idf.json')
def load_tf_idf_json(path_tf_freq=None, path_idf_freq=None, path_tf=None, path_idf=None, path_tf_idf=None):
"""
从json文件下载tf, idf, tf_idf
:param path_tf:
:param path_idf:
:param path_tf_idf:
:return:
"""
json_tf_freq = load_json(path_tf_freq)
json_idf_freq = load_json(path_idf_freq)
json_tf = load_json(path_tf)
json_idf = load_json(path_idf)
json_tf_idf = load_json(path_tf_idf)
return json_tf_freq[0], json_idf_freq[0], json_tf[0], json_idf[0], json_tf_idf[0]
def dict_add(dict1, dict2):
"""
两个字典合并
:param dict1:
:param dict2:
:return:
"""
for i,j in dict2.items():
if i in dict1.keys():
dict1[i] += j
else:
dict1.update({f'{i}' : dict2[i]})
return dict1
class TFIDF:
def __init__(self, questions=None, path_tf=None,
path_idf=None, path_tf_idf=None,
path_tf_freq=None, path_idf_freq=None,
ndigits=12, smooth=0):
"""
统计字频,或者词频tf
:param questions: list, 输入语料, 字级别的例子:[['我', '爱', '你'], ['爱', '护', '士']]
"""
self.esplion = 1e-16
self.questions = questions
self.path_tf_freq = path_tf_freq
self.path_idf_freq = path_idf_freq
self.path_tf=path_tf
self.path_idf=path_idf
self.path_tf_idf=path_tf_idf
self.ndigits=ndigits
self.smooth=smooth
self.create_tfidf()
def create_tfidf(self):
if self.questions != None: # 输入questions list, 即corpus语料
self.tf_freq = count_tf(self.questions)
self.idf_freq = count_idf(self.questions)
self.tf, self.idf, self.tfidf = count_tf_idf(self.tf_freq,
self.idf_freq,
ndigits=self.ndigits,
smooth =self.smooth)
else: # 输入训练好的
self.tf_freq, self.idf_freq, \
self.tf, self.idf, self.tfidf = load_tf_idf_json(path_tf_freq = self.path_tf_freq,
path_idf_freq = self.path_idf_freq,
path_tf=self.path_tf,
path_idf=self.path_idf,
path_tf_idf=self.path_tf_idf)
self.chars = [idf for idf in self.idf.keys()]
def extract_tfidf_of_sentence(self, ques):
"""
获取tf-idf
:param ques: str
:return: float
"""
assert type(ques)==str
if not ques.strip():
return None
ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
logger.info(ques_list)
score = 0.0
score_list = {}
for char in ques_list:
if char in self.chars:
score = score + self.tfidf[char]
score_list[char] = self.tfidf[char]
else: #
score = score + self.esplion
score_list[char] = self.esplion
score = score/len(ques_list)# 求平均避免句子长度不一的影响
logger.info(score_list)
logger.info({ques:score})
return score
def extract_tf_of_sentence(self, ques):
"""
获取idf
:param ques: str
:return: float
"""
assert type(ques)==str
if not ques.strip():
return None
ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
logger.info(ques_list)
score = 0.0
score_list = {}
for char in ques_list:
if char in self.chars:
score = score + self.tf[char]
score_list[char] = self.tf[char]
else: #
score = score + self.esplion
score_list[char] = self.esplion
score = score/len(ques_list)# 求平均避免句子长度不一的影响
logger.info(score_list)
logger.info({ques:score})
return score
def extract_idf_of_sentence(self, ques):
"""
获取idf
:param ques: str
:return: float
"""
assert type(ques)==str
if not ques.strip():
return None
ques_list = list(jieba.cut(ques.replace(' ', '').strip()))
logger.info(ques_list)
score = 0.0
score_list = {}
for char in ques_list:
if char in self.chars:
score = score + self.idf[char]
score_list[char] = self.idf[char]
else: #
score = score + self.esplion
score_list[char] = self.esplion
score = score/len(ques_list) # 求平均避免句子长度不一的影响
logger.info(score_list)
logger.info({ques:score})
return score
def create_TFIDF(path):
# 测试1,根据corpus生成
import time
time_start = time.time()
# 首先输入全部文本构建tf-idf,然后再拿去用
from tookit_sihui.conf.path_config import path_tf_idf_corpus
from tookit_sihui.utils.file_utils import txt_write, txt_read
path_wiki = path if path else path_tf_idf_corpus
# 测试1, tf-idf, 调用
path_dir = 'tf_idf_freq/'
# ques = ['大漠帝国最强', '花落惊飞羽最漂亮', '紫色Angle最有气质', '孩子气最活泼', '口袋巧克力和过路蜻蜓最好最可爱啦', '历历在目最烦恼']
# questions = [list(q.strip()) for q in ques]
# questions = [list(jieba.cut(que)) for que in ques]
questions = txt_read(path_wiki)
len_questions = len(questions)
batch_size = 1000000
size_trade = len_questions // batch_size
print(size_trade)
size_end = size_trade * batch_size
# 计算tf-freq, idf-freq
ques_tf_all, ques_idf_all = {}, {}
for i, (start, end) in enumerate(zip(range(0, size_end, batch_size),
range(batch_size, size_end, batch_size))):
print("第{}次".format(i))
question = questions[start: end]
questionss = [ques.strip().split(' ') for ques in question]
ques_idf = count_idf(questionss)
ques_tf = count_tf(questionss)
print('tf_idf_{}: '.format(i) + str(time.time() - time_start))
# 字典合并 values相加
ques_tf_all = dict_add(ques_tf_all, ques_tf)
ques_idf_all = dict_add(ques_idf_all, ques_idf)
print('dict_add_{}: '.format(i) + str(time.time() - time_start))
print('的tf:{}'.format(ques_tf_all['的']))
print('的idf:{}'.format(ques_idf_all['的']))
# 不足batch-size部分
if len_questions - size_end >0:
print("第{}次".format('last'))
question = questions[size_end: len_questions]
questionss = [ques.strip().split(' ') for ques in question]
ques_tf = count_idf(questionss)
ques_idf = count_tf(questionss)
# tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf, ques_idf)
ques_tf_all = dict_add(ques_tf_all, ques_tf)
ques_idf_all = dict_add(ques_idf_all, ques_idf)
print('{}: '.format('last') + str(time.time() - time_start))
print('的tf:{}'.format(ques_tf_all['的']))
print('的idf:{}'.format(ques_idf_all['的']))
# 计算tf-idf
tf_char, idf_char, tf_idf_char = count_tf_idf(ques_tf_all, ques_idf_all)
print(len(tf_char))
print('tf-idf ' + str(time.time()-time_start))
print('tf-idf ok!')
# 保存, tf,idf,tf-idf
save_tf_idf_json(path_dir, ques_tf_all, ques_idf_all, tf_char, idf_char, tf_idf_char)
gg=0
if __name__=="__main__":
# 测试1
path = None # 语料地址, 格式为切分后的句子, 例如'孩子 气 和 紫色 angle'
create_TFIDF(path)
# # 测试2, 调用class、json, input预测
# path_dir = 'tf_idf_freq/'
# path_tf = path_dir + '/tf.json'
# path_idf = path_dir + '/idf.json'
# path_tf_idf = path_dir + '/tf_idf.json'
#
# tfidf = TFIDF(path_tf=path_tf, path_idf=path_idf, path_tf_idf=path_tf_idf)
# score1 = tfidf.extract_tf_of_sentence('大漠帝国')
# score2 = tfidf.extract_idf_of_sentence('大漠帝国')
# score3 = tfidf.extract_tfidf_of_sentence('大漠帝国')
# print('tf: ' + str(score1))
# print('idf: ' + str(score2))
# print('tfidf: ' + str(score3))
# while True:
# print("请输入: ")
# ques = input()
# tfidf_score = tfidf.extract_tfidf_of_sentence(ques)
# print('tfidf:' + str(tfidf_score))
希望对你有所帮助!