2019-02 文本的预处理

文本的预处理操作大致分为:去除停用词、映射成索引、补全或截断、随机打乱、加载预训练词向量

1. Stop Words

## 对于英文来说,用nltk有整理一些
from nltk.corpus import stopwords
stop = set(stopwords.words('english')) #
print(stop)

2. To Word Index

# Tokenizer
# 保留的词频最高的num_words个数作为vocab_size-1,因为还有
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=20000, oov_token='') 
tokenizer.fit_on_texts(train_text)
train_idxs = tokenizer.text_to_sequences(train_text)
test_idxs = tokenizer.text_to_sequences(test_text)

train_padded = tf.keras.preprocessing.sequence.pad_sequences(train_idxs, 
                                                             maxlen=MAX_LENGTH, 
                                                             padding='post', 
                                                             truncating='post') ## padding的方向和截取的方向

下面的这几条也非常实用

word2id = tokenizer.word_index ## word2idx 的一个字典
id2word = {idx: word for word, idx in zip(word2idx.keys(), word2idx.values())} ## 构建id2word

3. Shuffle

这里先介绍小数据规模下全部加载进内存的shuffle操作

import numpy as np
np.random.shuffle(train_set)
np.random.shuffle(test_set)
或者
import pandas
train = pandas.Series.sample(frac=0.9) ## 既起到shuffle作用,又起到sampling的作用
test = pandas.Series.sample(frac=1.0)

4. Load Pre-trained Word Embedding

## 先加载pre-trained vector文件
def loadGloVe(filename):
    vocab = []
    embd = []
    file = open(filename, 'r')
    for line in file.readlines(): # 读取 txt 的每一行
        row = line.strip().split(' ')
        vocab.append(row[0])
        embd.append(row[1:])
    print('Loaded GloVe!')
    file.close()
    return vocab, embd

vocab, embd = loadGloVe(filename)
vocab_size = len(vocab) # 词表的大小
embedding_dim = len(embd[0]) # embedding 的维度
print("Vocab size : ", vocab_size)
print("Embedding dimensions : ", embedding_dim)

## 根据 vocab 将文本转化为对应的 Id
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_LENGTH)
pretrain = vocab_processor.fit(vocab) # 根据我们的 vocab 进行 fit
x_transform_train = vocab_processor.transform(x_train) # train set
x_transform_test = vocab_processor.transform(x_test) # test set

vocab = vocab_processor.vocabulary_
vocab_size_after_process = len(vocab) # 注意:这个size和前面加载的vocab的不一样了,忽略所有非单词的符号,并且添加了符号
print("Vocab size after process:", vocab_size_after_process)

## 进行Tensorflow Embedding 操作
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size_after_process, embedding_dim]) # 通过 Placeholder 喂给 graph

你可能感兴趣的:(2019-02 文本的预处理)