文本向量化

知识点普及

  • 逆文档频率(IDF): 每个词的权重,大小与该词在文档中出现频率成反比.
  log(文档总数/包含该词的文档数+1)
  • TF-IDF:权衡某个分词是否为关键词的指标, 该值越大是关键词的可能性越大
TF-IDF =  TF * IDF 
TF :词频数
  • 文本向量化:假设有m篇文章d1,d2,...dn,对它们进行分词,得到n个分词w1,w2....wn,那么Fij代表第i篇文章中分词j出现的次数,这些文章可用矩阵标识

附注
zhPartent = re.compile(u'[\u4e00-\u9fa5]+') 匹配中文分词

  • IDF 计算
def hanlder(x):
    return (numpy.log2(len(corpos)/numpy.sum(x>0)+1))
IDF =  TF.apply(hanlder)
  • 交叉统计函数
  pivot_table(values,index,columns,aggfunc,fill_value)

实例代码

  # -*- coding: utf-8 -*-

import numpy

#创建语料库
import os;
import os.path;
import codecs;

filePaths = [];
fileContents = [];
for root, dirs, files in os.walk(
    "D:\\PDM\\2.7\\SogouC.mini\\Sample"
):
    for name in files:
        filePath = os.path.join(root, name);
        filePaths.append(filePath);
        f = codecs.open(filePath, 'r', 'utf-8')
        fileContent = f.read()
        f.close()
        fileContents.append(fileContent)

import pandas;
corpos = pandas.DataFrame({
    'filePath': filePaths, 
    'fileContent': fileContents
});

import re
#匹配中文的分词
zhPattern = re.compile(u'[\u4e00-\u9fa5]+')

import jieba

segments = []
filePaths = []
for index, row in corpos.iterrows():
    filePath = row['filePath']
    fileContent = row['fileContent']
    segs = jieba.cut(fileContent)
    for seg in segs:
        if zhPattern.search(seg):
            segments.append(seg)
            filePaths.append(filePath)

segmentDF = pandas.DataFrame({
    'filePath':filePaths, 
    'segment':segments
})

#移除停用词
stopwords = pandas.read_csv(
    "D:\\PDM\\2.7\\StopwordsCN.txt", 
    encoding='utf8', 
    index_col=False,
    quoting=3,
    sep="\t"
)

segmentDF = segmentDF[
    ~segmentDF.segment.isin(
        stopwords.stopword
    )
]

#按文章进行词频统计
segStat = segmentDF.groupby(
    by=["filePath", "segment"]
)["segment"].agg({
    "计数":numpy.size
}).reset_index().sort(
    columns=["计数"],
    ascending=False
);

#把小部分的数据删除掉
segStat = segStat[segStat.计数>1]

#进行文本向量计算
TF = segStat.pivot_table(
    index='filePath', 
    columns='segment', 
    values='计数',
    fill_value=0
)

TF.index
TF.columns

def hanlder(x): 
    return (numpy.log2(len(corpos)/(numpy.sum(x>0)+1)))

IDF = TF.apply(hanlder)

TF_IDF = pandas.DataFrame(TF*IDF)

tag1s = []
tag2s = []
tag3s = []
tag4s = []
tag5s = []

for filePath in TF_IDF.index:
    tagis = TF_IDF.loc[filePath].order(
        ascending=False
    )[:5].index
    tag1s.append(tagis[0])
    tag2s.append(tagis[1])
    tag3s.append(tagis[2])
    tag4s.append(tagis[3])
    tag5s.append(tagis[4])

tagDF = pandas.DataFrame({
    'filePath':corpos.filePath, 
    'fileContent':corpos.fileContent, 
    'tag1':tag1s, 
    'tag2':tag2s, 
    'tag3':tag3s, 
    'tag4':tag4s, 
    'tag5':tag5s
})

sklearn 实现文档向量化

sklearn关于数据挖掘知识请参见:使用sklearn进行数据挖掘

本例中所用知识点

  • 文档向量化:sklearn.feature_extraction.text.CountVectorizer
  • TFIDF计算:sklearn.feature_extraction.text.TfidTransformer

实例代码

#!/usr/bin/env python
# coding=utf-8


contents = [
    '我 是 中国人。',
    '你 是 美国人。',
    '你 叫 什么 名字?',
    '她 是 谁 啊?'
]
from  sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer()
textVector = countVectorizer.fit_transform(contents)

textVector.todense()
countVectorizer.vocabulary_


countVectorizer = CountVectorizer(min_df = 0,
                                 token_pattern= r"\b\w+\b")
textVector = countVectorizer.fit_transform(contents)
textVector.todense()


countVectorizer.vocabulary_

from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()

tfid = transformer.fit_transform(textVector)


import pandas as pd 

TFIDFDataFrame = pd.DataFrame(tfid.toarray())
TFIDFDataFrame.columns = countVectorizer.get_feature_names()

import numpy as np 
TFIDFSorted = np.argsort(tfid.toarray(),axis=1)[:,-2:]


TFIDFDataFrame.columns[TFIDFSorted].values
# print (TFIDFDataFrame)

你可能感兴趣的:(文本向量化)