TF-IDF值和文本向量化

根据提取的特征词计算特征值,即TF-IDF。采用向量空间模型(VSM)将文档表示成向量,并将文档输出为WEKA能处理的.arff格式。

直接上代码:

#!/user/bin/python
# -*- coding: utf-8 -*-

import codecs
import math

# 特征词列表
feture_word = []  # 存放特征词
feture_word_dic = {}  # 存放特征词DF
feture_word_dic2 = {}  # 计算并存放每个特征词的IDF

f = codecs.open('/Users/Administrator/Desktop/ni.txt','rb',encoding='utf-8')
for line in f:
    line = line.split()
    IDF = math.log(4205/float(line[1]),10)  
    feture_word.append(line[0])
    feture_word_dic[line[0]] = line[1]
    feture_word_dic2[line[0]] = IDF

alltext = []
for j in range(1,10):
    for i in range(10,510):
        dic = {}
        try:
            f = codecs.open('/Users/Administrator/Desktop/wordsfrequence2/%d/%d.txt' % (j,i), 'rb',encoding='utf-8')
            for x in range(1,2):
                p = f.readline()
                p = p.split()
                tmax = p[1]
                dic['【tmax】'] = p[1]
            for line in f:
                line = line.split()
                dic[line[0]] = line[1]
            dic['【type】'] = j
            alltext.append(dic)
        except:
            print u'问题文档',j,i
            continue

alltext_vector = []
for dic in alltext:
    vector = []
    for word in feture_word:
        if word in dic:
            t = dic[word]
        else:
            t = 0
        tf_idf = (float(t)/float(dic['【tmax】']))*feture_word_dic2[word]
        vector.append(tf_idf)
    texttype = dic['【type】']
    vector.append(texttype)
    alltext_vector.append(vector)
    
data = codecs.open('/Users/Administrator/Desktop/data.arff','a',encoding='utf-8')
data.truncate()

data.write(u'@relation'+' '+u'sougoucorpus'+'\n\n')
for everyword in feture_word:
    data.write(u'@attribute'+ ' '+ everyword +' '+u'numeric\n')
data.write(u'@attribute 【type】 {1,2,3,4,5,6,7,8,9}\n\n@data\n')
for vector in alltext_vector:
    for value in vector[:-1]:
        data = codecs.open('/Users/Administrator/Desktop/data.arff','a',encoding='utf-8')
        data.write(str(value) + ',')
    data.write(str(vector[-1]) + '\n')
    data.close()

运行结果:

TF-IDF值和文本向量化_第1张图片

你可能感兴趣的:(Python实战,自然语言处理)