python 卡方检验原理及应用

卡方检验,或称x2检验。

无关性假设:
假设我们有一堆新闻或者评论,需要判断内容中包含某个词(比如6得很)是否与该条新闻的情感归属(比如正向)是否有关,我们只需要简单统计就可以获得这样的一个四格表:

组别  属于正向    不属于正向   合计
不包含6得很  19  24  43
包含6得很   34  10  44
合计  53  34  87  

通过这个四格表我们得到的第一个信息是:内容是否包含某个词比如6得很确实对新闻是否属于正向有统计上的差别,包含6得很的新闻属于正向的比例更高,但我们还无法排除这个差别是否由于抽样误差导致。那么首先假设内容是否包含6得很与新闻是否属于正向是独立无关的,随机抽取一条新闻标题,属于正向类别的概率是:(19 + 34) / (19 + 34 + 24 +10) = 60.9%

理论值四格表:
第二步,根据无关性假设生成新的理论值四格表:

组别  属于正向    不属于正向   合计
不包含6得很  43 * 0.609 = 26.2   43 * 0.391 = 16.8   43
包含6得很   44 * 0.609 = 26.8   44 * 0.391 = 17.2   44

显然,如果两个变量是独立无关的,那么四格表中的理论值与实际值的差异会非常小。

x2值的计算
这里写图片描述

其中A为实际值,也就是第一个四格表里的4个数据,T为理论值,也就是理论值四格表里的4个数据。

x2用于衡量实际值与理论值的差异程度(也就是卡方检验的核心思想),包含了以下两个信息:

实际值与理论值偏差的绝对大小(由于平方的存在,差异是被放大的)
差异程度与理论值的相对大小

对上述场景可计算x2值为10.01。

卡方分布的临界值
既然已经得到了x2值,我们又怎么知道x2值是否合理?也就是说,怎么知道无关性假设是否可靠?答案是,通过查询卡方分布的临界值表。
这里需要用到一个自由度的概念,自由度等于V = (行数 - 1) * (列数 - 1),对四格表,自由度V = 1。
对V = 1,卡方分布的临界概率是:

这里写图片描述

显然10.01 > 7.88,也就是内容是否包含6得很与新闻是否属于正向无关的可能性小于0.5%,反过来,就是两者相关的概率大于99.5%。

应用场景
卡方检验的一个典型应用场景是衡量特定条件下的分布是否与理论分布一致,比如:特定用户某项指标的分布与大盘的分布是否差异很大,这时通过临界概率可以合理又科学的筛选异常用户。

另外,x2值描述了自变量与因变量之间的相关程度:x2值越大,相关程度也越大,所以很自然的可以利用x2值来做降维,保留相关程度大的变量。再回到刚才新闻情感分类的场景,如果我们希望获取和正向类别相关性最强的100个词,以后就按照内容是否包含这100个词来确定新闻是否归属于正向,怎么做?很简单,对正向类所包含的每个词按上述步骤计算x2值,然后按x2值排序,取x2值最大的100个词。

#! /usr/bin/env python2.7
#coding=utf-8

"""
Use positive and negative review set as corpus to train a sentiment classifier.
This module use labeled positive and negative reviews as training set, then use nltk scikit-learn api to do classification task.
Aim to train a classifier automatically identifiy review's positive or negative sentiment, and use the probability as review helpfulness feature.

"""

from Preprocessing_module import textprocessing as tp
import pickle
import itertools
from random import shuffle

import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

import sklearn
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.metrics import accuracy_score


# 1. Load positive and negative review data
pos_review = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
neg = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)
zhong = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\zhong.xlsx",1,1)
pos = pos_review
neg = neg
zhong = zhong

"""
# Cut positive review to make it the same number of nagtive review (optional)

shuffle(pos_review)
size = int(len(pos_review)/2 - 18)

pos = pos_review[:size]
neg = neg

"""


# 2. Feature extraction function
# 2.1 Use all words as features
def bag_of_words(words):
    return dict([(word, True) for word in words])


# 2.2 Use bigrams as features (use chi square chose top 200 bigrams)
def bigrams(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return bag_of_words(bigrams)


# 2.3 Use words and bigrams as features (use chi square chose top 200 bigrams)
def bigram_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return bag_of_words(words + bigrams)


# 2.4 Use chi_sq to find most informative features of the review
# 2.4.1 First we should compute words or bigrams information score
def create_word_scores():
    posdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
    negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)
    zhongdata = negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\zhong.xlsx", 1, 1)

    posWords = list(itertools.chain(*posdata))
    negWords = list(itertools.chain(*negdata))
    zhongWords = list(itertools.chain(*zhongdata))

    word_fd = FreqDist()
    cond_word_fd = ConditionalFreqDist()
    for word in posWords:
        word_fd.inc(word)
        cond_word_fd['pos'].inc(word)
    for word in negWords:
        word_fd.inc(word)
        cond_word_fd['neg'].inc(word)
    for word in zhongWords:
        word_fd.inc(word)
        cond_word_fd['zhong'].inc(word)
    pos_word_count = cond_word_fd['pos'].N()
    neg_word_count = cond_word_fd['neg'].N()
    zhong_word_count = cond_word_fd['zhong'].N()
    #print zhong_word_count
    total_word_count = pos_word_count + neg_word_count + zhong_word_count

    word_scores = {}
    for word, freq in word_fd.iteritems():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
        zhong_score = BigramAssocMeasures.chi_sq(cond_word_fd['zhong'][word], (freq, zhong_word_count), total_word_count)
        word_scores[word] = pos_score + neg_score +zhong_score

    return word_scores

def create_bigram_scores():
    posdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
    negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)
    zhongdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\zhong.xlsx", 1, 1)

    posWords = list(itertools.chain(*posdata))
    negWords = list(itertools.chain(*negdata))
    zhongWords = list(itertools.chain(*zhongdata))

    bigram_finder = BigramCollocationFinder.from_words(posWords)
    bigram_finder = BigramCollocationFinder.from_words(negWords)
    bigram_finder = BigramCollocationFinder.from_words(zhongWords)
    posBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 8000)
    negBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 8000)
    zhongBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 8000)
    pos = posBigrams
    neg = negBigrams
    zhong = zhongBigrams
    word_fd = FreqDist()
    cond_word_fd = ConditionalFreqDist()
    for word in pos:
        word_fd.inc(word)
        cond_word_fd['pos'].inc(word)
    for word in neg:
        word_fd.inc(word)
        cond_word_fd['neg'].inc(word)
    for word in neg:
        word_fd.inc(word)
        cond_word_fd['zhong'].inc(word)
    pos_word_count = cond_word_fd['pos'].N()
    neg_word_count = cond_word_fd['neg'].N()
    zhong_word_count = cond_word_fd['zhong'].N()
    total_word_count = pos_word_count + neg_word_count + zhong_word_count

    word_scores = {}
    for word, freq in word_fd.iteritems():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
        zhong_score = BigramAssocMeasures.chi_sq(cond_word_fd['zhong'][word], (freq, neg_word_count), total_word_count)
        word_scores[word] = pos_score + neg_score + zhong_score

    return word_scores

# Combine words and bigrams and compute words and bigrams information scores
def create_word_bigram_scores():
    posdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
    negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)

    posWords = list(itertools.chain(*posdata))
    negWords = list(itertools.chain(*negdata))

    bigram_finder = BigramCollocationFinder.from_words(posWords)
    bigram_finder = BigramCollocationFinder.from_words(negWords)
    posBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)
    negBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)

    pos = posWords + posBigrams
    neg = negWords + negBigrams

    word_fd = FreqDist()
    cond_word_fd = ConditionalFreqDist()
    for word in pos:
        word_fd.inc(word)
        cond_word_fd['pos'].inc(word)
    for word in neg:
        word_fd.inc(word)
        cond_word_fd['neg'].inc(word)

    pos_word_count = cond_word_fd['pos'].N()
    neg_word_count = cond_word_fd['neg'].N()
    total_word_count = pos_word_count + neg_word_count

    word_scores = {}
    for word, freq in word_fd.iteritems():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
        word_scores[word] = pos_score + neg_score

    return word_scores

# Choose word_scores extaction methods
word_scores = create_word_scores()
#word_scores = create_bigram_scores()
# word_scores = create_word_bigram_scores()


# 2.4.2 Second we should extact the most informative words or bigrams based on the information score
def find_best_words(word_scores, number):
    best_vals = sorted(word_scores.iteritems(), key=lambda (w, s): s, reverse=True)[:number]
    best_words = set([w for w, s in best_vals])
    return best_words

# 2.4.3 Third we could use the most informative words and bigrams as machine learning features
# Use chi_sq to find most informative words of the review
def best_word_features(words):
    return dict([(word, True) for word in words if word in best_words])

# Use chi_sq to find most informative bigrams of the review
def best_word_features_bi(words):
    return dict([(word, True) for word in nltk.bigrams(words) if word in best_words])

# Use chi_sq to find most informative words and bigrams of the review
def best_word_features_com(words):
    d1 = dict([(word, True) for word in words if word in best_words])
    d2 = dict([(word, True) for word in nltk.bigrams(words) if word in best_words])
    d3 = dict(d1, **d2)
    return d3



# 3. Transform review to features by setting labels to words in review
def pos_features(feature_extraction_method):
    posFeatures = []
    #print "pos"
    for i in pos:
        #for key in feature_extraction_method(i):
            #print key
        posWords = [feature_extraction_method(i),'pos']
        posFeatures.append(posWords)
    return posFeatures

def neg_features(feature_extraction_method):
    negFeatures = []
    #print "neg"
    for j in neg:
        #for key in feature_extraction_method(j):
          # print key
        negWords = [feature_extraction_method(j),'neg']
        negFeatures.append(negWords)
    return negFeatures

def zhong_Features(feature_extraction_method):
    zhongFeatures = []
    print "zhong"
    for j in zhong:
        for key in feature_extraction_method(j):
            print key
        zhongWords = [feature_extraction_method(j),'zhong']
        zhongFeatures.append(zhongWords)
    return zhongFeatures
best_words = find_best_words(word_scores, 1000) # Set dimension and initiallize most informative words

posFeatures = pos_features(best_word_features_com)
negFeatures = neg_features(best_word_features_com)
zhongFeatures = zhong_Features(best_word_features_com)
# posFeatures = pos_features(bigram_words)
# negFeatures = neg_features(bigram_words)

#posFeatures = pos_features(best_word_features)
#print type(posFeatures)

#negFeatures = neg_features(best_word_features)
#zhongFeatures = zhong_Features(best_word_features)
# posFeatures = pos_features(best_word_features_com)
# negFeatures = neg_features(best_word_features_com)

结果如下:

中性情感词:
普及
成为
家庭
信息
成
事件
谋求
事情
产生
舆论
引发
发生
当而
方
场
人们
相信
世界
翻转
反
公众
罗尔
网络时代
规则
对称
应该
面对
更要
对立
不要
媒体
微信
一轮
网络
号
换位
阵营
网友
都
技能
不明智
新闻
很
激烈
不
蕴藏
聪明反被聪明误
相关
社交
大
刷屏
二小
学会
感到
越来越
人
清晰
中关村
破坏
微博
影响
汇集
屡有
不同
真实性
正向情感词:

荣誉证书
去年
第二
名叫
年前
写
会
节省
患
同意
收下
申请
爬
李庆国
当下
无疑
抢救
8
买药
但确
凑
健康
难以
医药费
好心人
家里
感谢
19
仍然
年
一年
广怀
呼吸困难
公里
二
江苏
学校
母亲
长大
小时
状态
看起来
常态
成功
晕倒
爬楼梯
3
出发
一拖再拖
无法
患有
想法
期限
赶到
减轻
出去
出现
火车
最早
父母



负向情感词:
改革
往往
率
取消
月
说法
中止执行
会
搞
存在
事情
案件
工作
执行
数据
司法
视频
后续
回应
刑事拘留
指标
不合理
法
不
应该
法院
会议
日
这次
澎湃
之前
数
不算
考核
却
告别
坚决
其实
屡屡
提出
必要
任务
手段
结案率
进行
执法
法官
撤案
审理
新
罚款
法律
不停
年
年底
项目
不是
杜绝
承德县

你可能感兴趣的:(自然语言处理)