法研杯比赛总结

比赛已经结束好久了,准备来总结一下。

法研杯比赛总结_第1张图片

比赛结果:

队名shelley,做了任务一和任务二,排名如下:

任务一和任务二差不多,任务三有离散标签和连续标签,效果不太好,期待大牛的分享。

自己的模型使用attention和Textcnn。

整个处理流程如下:

法研杯比赛总结_第2张图片

1. 文本预处理

去除地名,人名,一般停用词。

人名

法研杯比赛总结_第3张图片

地名,分词好后,根据特定的字过滤。

fact_str = re.sub(r'\r', '', fact_str)
        fact_str = re.sub(r'\t', '', fact_str)
        fact_str = re.sub(r'\n', '', fact_str)
        fact_str = re.sub(r'([0-9]{4}年)?[0-9]{1,2}月([0-9]{1,2}日)?', '', fact_str)
        fact_str = re.sub(r'[0-9]{1,2}时([0-9]{1,2}分)?许?', '', fact_str)

        cut_list = jieba.cut(fact_str)
        for w in cut_list:
            if w in stopwords:
                continue
            elif '省' in w:
                continue
            elif '市' in w:
                continue
            elif '镇' in w:
                continue
            elif '村' in w:
                continue
            elif '路' in w:
                continue
            elif '县' in w:
                continue
            elif '区' in w:
                continue
            elif '城' in w:
                continue
            elif '府' in w:
                continue
            elif '庄' in w:
                continue
            elif '道' in w:
                continue
            elif '车' in w:
                continue
            elif '店' in w:
                continue
            elif '某' in w:
                continue
            elif '辆' in w:
                continue
            elif '房' in w:
                continue
            elif '馆' in w:
                continue
            elif '场' in w:
                continue
            elif '街' in w:
                continue
            elif '墙' in w:
                continue
            elif '牌' in w:
                continue
            else:
                fact_list.append(w)

2. 模型

2.1 attention

等有空再写

 

2.2 TextCNN

等有空再写

 

3. 模型评价函数

# -*- coding:utf-8 -*-

from __future__ import division
import numpy as np

def sigmoid(X):
    sig = [1.0 / float(1.0 + np.exp(-x)) for x in X]
    return sig

def to_categorical_single_class(cl):
    y = np.zeros(183)

    for i in range(len(cl)):
        y[cl[i]] = 1
    return y

def cail_evaluator(predict_labels_list, marked_labels_list):
    # predict labels category
    predict_labels_category = []
    samples = len(predict_labels_list)
    print('num of samples: ', samples)
    for i in range(samples):  # number of samples
        predict_norm = sigmoid(predict_labels_list[i])
        predict_category = [1 if i > 0.5 else 0 for i in predict_norm]
        predict_labels_category.append(predict_category)

    # marked labels category
    marked_labels_category = []
    num_class = len(predict_labels_category[0])
    print('num of classes: ', num_class)
    for i in range(samples):
        marked_category = to_categorical_single_class(marked_labels_list[i])
        marked_labels_category.append(marked_category)

    tp_list = []
    fp_list = []
    fn_list = []
    f1_list = []
    for i in range(num_class):  # 类别个数

        tp = 0.0  # predict=1, truth=1
        fp = 0.0  # predict=1, truth=0
        fn = 0.0  # predict=0, truth=1
        # 样本个数
        pre = [p[i] for p in predict_labels_category]
        mar = [p[i] for p in marked_labels_category]
        pre = np.asarray(pre)
        mar = np.asarray(mar)

        for i in range(len(pre)):
            if pre[i] == 1 and mar[i] == 1:
                tp += 1
            elif pre[i] == 1 and mar[i] == 0:
                fp += 1
            elif pre[i] == 0 and mar[i] == 1:
                fn += 1
        # print('tp: %s, fp: %s, fn:%s ' %(tp, fp, fn))
        precision = 0.0
        if tp + fp > 0:
            precision = tp / (tp + fp)
        recall = 0.0
        if tp + fn > 0:
            recall = tp / (tp + fn)
        f1 = 0.0
        if precision + recall > 0:
            f1 = 2.0 * precision * recall / (precision + recall)
        tp_list.append(tp)
        fp_list.append(fp)
        fn_list.append(fn)
        f1_list.append(f1)

    # micro level
    f1_micro = 0.0
    if sum(tp_list) + sum(fp_list) > 0:
        f1_micro = sum(tp_list) / (sum(tp_list) + sum(fp_list))
    # macro level
    f1_macro = sum(f1_list) / len(f1_list)
    score12 = (f1_macro + f1_micro) / 2.0

    return f1_micro, f1_macro, score12

部分代码可以参考我的github:https://github.com/shelleyHLX/cail

参考文献

(1)TextCNN:

Kim Y. Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv, 2014.

Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[J]. 2017:1107-1116.

Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[J]. 2014:1-9.

2Attention

Yang Z, Yang D, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2017:1480-1489.

 

 

 

你可能感兴趣的:(自然语言处理,文本挖掘,比赛,文本分类)