通过上述三篇文章的介绍(详见其他的博客),接下来将对TF-IDF算法的实现进行介绍。
关键词提取的代码如下:
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))
测试样本为《西游记》的TXT文件(后续代码的测试样本都是基于此文本进行测试,不再进行说明),在命令行中输入如下语句:
python extract_tags.py xiyouji.txt -k 10
所得结果如下所示:
用法: jieba.analyse.set_idf_path(file_name) # file_name为自定义语料库的路径
自定义语料库示例:
代码如下:
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_idfpath.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
jieba.analyse.set_idf_path("../jieba-基于 TF-IDF 算法的关键词抽取/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))
在命令行中输入如下代码:
python extract_tags_idfpath.py xiyouji.txt -k 10
所得结果如下所示:
用法: jieba.analyse.set_stop_words(file_name) # file_name为自定义语料库的路径
自定义语料库示例(注:字典采用utf-8的编码格式):
代码如下:
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_stop_words.py [file name] -k [top k]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
content = open(file_name, 'rb').read()
jieba.analyse.set_stop_words("../extra_dict/stop_words.txt")
jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(",".join(tags))
在命令行中输入代码:
python extract_tags_stop_words.py xiyouji.txt -k 10
所得结果如下所示:
代码如下:
import sys
sys.path.append('../')
import jieba
import jieba.analyse
from optparse import OptionParser
USAGE = "usage: python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]"
parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
parser.add_option("-w", dest="withWeight")
opt, args = parser.parse_args()
if len(args) < 1:
print(USAGE)
sys.exit(1)
file_name = args[0]
if opt.topK is None:
topK = 10
else:
topK = int(opt.topK)
if opt.withWeight is None:
withWeight = False
else:
if int(opt.withWeight) is 1:
withWeight = True
else:
withWeight = False
content = open(file_name, 'rb').read()
tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)
if withWeight is True:
for tag in tags:
print("tag: %stt weight: %f" % (tag[0],tag[1]))
else:
print(",".join(tags))
在命令行中输入代码:
python extract_tags_with_weight.py xiyouji.txt -k 10 -w 1
(1为显示权重值,0为不显示权重值)
所得结果如下所示: