文本情感分析—数据预处理

数据预处理代码:
引自:文本情感分析

def load_data(filepath, input_shape=20):
    df = pd.read_csv(filepath)

    # 标签及词汇表
    labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique())

    # 构造字符级别的特征
    string = ''
    for word in vocabulary:
        string += word
    vocabulary = set(string)

    # 字典列表
    word_dictionary = {word: i+1 for i, word in enumerate(vocabulary)}
    with open('word_dict.pk', 'wb') as f:
        pickle.dump(word_dictionary, f)
    inverse_word_dictionary = {i+1: word for i, word in enumerate(vocabulary)}
    label_dictionary = {label: i for i, label in enumerate(labels)}
    with open('label_dict.pk', 'wb') as f:
        pickle.dump(label_dictionary, f)
    output_dictionary = {i: labels for i, labels in enumerate(labels)}

    vocab_size = len(word_dictionary.keys()) # 词汇表大小
    label_size = len(label_dictionary.keys()) # 标签类别数量

    # 序列填充,按input_shape填充,长度不足的按0补充
    x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']]
    x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
    y = [[label_dictionary[sent]] for sent in df['label']]
    y = [np_utils.to_categorical(label, num_classes=label_size) for label in y]
    y = np.array([list(_[0]) for _ in y])

    return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary

语句1:

labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique())

效果示例:

['正面', '负面']

作用:取出数据集中的数据

语句2:

string = ''
for word in vocabulary:
    string += word
vocabulary = set(string)

作用:便于构建字典列表

语句3:

word_dictionary = {word: i + 1 for i, word in enumerate(vocabulary)}
with open('word_dict.pk', 'wb') as f:
    pickle.dump(word_dictionary, f)
inverse_word_dictionary = {i + 1: word for i, word in enumerate(vocabulary)}
label_dictionary = {label: i for i, label in enumerate(labels)}
with open('label_dict.pk', 'wb') as f:
    pickle.dump(label_dictionary, f)
output_dictionary = {i: labels for i, labels in enumerate(labels)}

构建字典列表,即可以认为是一个hashtable,将数据中的字给编号,便于将句子转化成整数的矩阵。
例如:将“我爱你”,“我喜欢你”和“我不喜欢你”转化成
1 2 3 0 0
1 4 5 3 0
1 6 4 5 3
便于后面训练模型使用。

pickle.dump

pickle.dump(obj, file, [,protocol])
注释:序列化对象,将对象obj保存到文件file中去。参数protocol是序列化模式,默认是0,以文本形式进行序列化。

enumerate

enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标。
enumerate用法介绍

语句4:

x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']]
x = pad_sequences(maxlen=180, sequences=x, padding='post', value=0)
y = [[label_dictionary[sent]] for sent in df['label']]
y = [np_utils.to_categorical(label, num_classes=label_size) for label in y]
y = np.array([list(_[0]) for _ in y])

在倒数第二行将y转换成onehot时,此时y输出为:

[array([[1., 0.]], dtype=float32), array([[1., 0.]], dtype=float32)]

因此,需要用np.array将y转换成onehot表示。

pad_sequences语法:

keras.preprocessing.sequence.pad_sequences(sequences, 
 maxlen=None,
 dtype='int32',
 padding='pre',
 truncating='pre', 
 value=0.)

sequences:浮点数或整数构成的两层嵌套列表
maxlen:None或整数,为序列的最大长度。大于此长度的序列将被截短,小于此长度的序列将在后部填0.
dtype:返回的numpy array的数据类型
padding:‘pre’或‘post’,确定当需要补0时,在序列的起始还是结尾补`
truncating:‘pre’或‘post’,确定当需要截断序列时,从起始还是结尾截断
value:浮点数,此值将在填充时代替默认的填充值0

to_categorical语法:

to_categorical(y, num_classes=None, dtype=‘float32’)
将整型的类别标签转为onehot编码。

你可能感兴趣的:(机器学习)