nlp常用的数据处理方法

nlp常用的工具方法

（如果不总结真的会忘记的yongbo）

评估公式

acc p R F1

tp--将正类预测为正类（true positive）

fn--将正类预测为负类（false negative）

fp--将负类预测为正类（false positive）

tn--将负类预测为负类（true negative）

acc :

f1:

1.数据去空

testfile=testfile.dropna(axis=0, how='any')

2.Keras Tokenizer

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(testfile.train) #转成 序号-单词
x_train = tokenizer.texts_to_sequences(testfile.train)#转成序号格式

print(testfile['train'][0])
print(x_train[0])
#What is the step by step guide to invest in share market in india?What is the step by step guide to invest in share market?
#[2, 3, 1, 1221, 57, 1221, 2581, 7, 575, 8, 763, 383, 8, 35, 2, 3, 1, 1221, 57, 1221, 2581, 7, 575, 8, 763, 383]

z_dict = tokenizer.word_index
print(z_dict['what'])
print(z_dict['market'])
#2
#383

参数解释：

https://zhuanlan.zhihu.com/p/138054335

keras.preprocessing.text.Tokenizer(num_words=None,
                                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n',
                                   lower=True,
                                   split=' ',
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

num_words ：保留的最大词数，根据词频计算。默认为None是处理所有字词。如果设置成一个整数，那么最后返回的是最常见的、出现频率最高的 num_words 个字词。
filters ：过滤掉常用的特殊符号，默认上文的写法就可以了。
lower ：是否转化为小写。
split ：词的分隔符，如空格。【这么说中文考虑先分词，再转换】
char_level ：是否将每个字符都认为是词，默认是否。在处理中文时如果每个字都作为是词，这个参数改为True.
oov_token ：如果给出，会添加到词索引中，用来替换超出词表的字符。
document_count ：文档个数，这个参数一般会根据喂入文本自动计算，无需给出。

word_counts ：字典，将单词（字符串）映射为它们在训练期间出现的次数。仅在调用fit_on_texts之后设置。

word_docs ：字典，将单词（字符串）映射为它们在训练期间所出现的文档或文本的数量。仅在调用fit_on_texts之后设置。

word_index ：字典，将单词（字符串）映射为它们的排名或者索引。仅在调用fit_on_texts之后设置。

document_count ：整数。分词器被训练的文档（文本或者序列）数量。仅在调用fit_on_texts或fit_on_sequences之后设置。

相关解释：https://zhuanlan.zhihu.com/p/81720968

fit_on_texts(texts)

基于文本集texts构建词汇表。在调用texts_to_sequences或texts_to_matrix方法之前，必须先调用该方法构建词汇表。

参数：

texts: （1）1个由字符串构成的列表，列表中的每个元素是一个字符串（表示一篇文本）（2）1个由字符串类型构成的生成器（当文本语料很大而不能通过列表直接加载到内存中时使用生成器一次读取一个文本到内存）（3）1个由字符串构成的列表的列表，此时一篇文本不再由一个字符串表示，而是由一个词语列表表示，词语列表中的每个元素是一个字符串（表示一个词语），整个语料由多个词语列表表示。这种参数针对需要经过特定分词器输出的文本集，比如中文文本集。

texts_to_sequences(texts)

将文本集中的每篇文本变换为词语索引序列。注意：只有属于词汇表中前num_words的词才被索引替换，其他词直接忽略。

参数：

texts: 同fit_on_texts方法的参数。

返回：词索引列表构成的文本集列表。

3.pad_sequences

填充

from keras.preprocessing.sequence import pad_sequences
tmp = [[1,2],
       [3,4,2],
       [1,3],
       [4,3,1]]
x_train = pad_sequences(tmp, maxlen=6)
print(x_train)
[[0 0 0 0 1 2]
 [0 0 0 3 4 2]
 [0 0 0 0 1 3]
 [0 0 0 4 3 1]]

摘自https://zhuanlan.zhihu.com/p/105376030
sequences: 列表的列表，列表中的每个元素都是一个序列。
maxlen: 整数，所有序列的最大长度。
dtype: 输出序列的类型。输出序列的类型。为了用可变长度字符串填充序列，可以使用 object。
padding: 字符串, 'pre' 或 'post': 在序列前填充或在序列后填充。
truncating: 字符串, 'pre' 或 'post': 如果序列长度大于maxlen的值，从序列前端截取或者从序列后端截取。
value: 单精度浮点数或者字符串，填充值。

4. 输入层和Embedding

(代码来着kaggle 比赛)

使用自己训练的词向量

如果使用词向量：trainable = False

from keras.layers import GRU
embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_length=SEQUENCE_LENGTH, trainable=False)
model = Sequential()
model.add(embedding_layer)

model.add(LSTM(120, dropout=0.2, recurrent_dropout=0.2))
# model.add(GRU(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(3, activation='softmax'))

model.summary()

得到词向量矩阵的方法

embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]
    else:
        embedding_matrix[i]= np.random.uniform(1, -1, W2V_SIZE)# 如果是unknown word随机生成
        
print(embedding_matrix.shape)

使用现有的词向量

def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = {}
with open(os.path.join('data/', 'glove.6B.50d.txt'), encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]  # 单词
        coefs = np.asarray(values[1:], dtype='float32')  # 单词对应的向量
        embeddings_index[word] = coefs  # 单词及对应的向量
max_features=100000
embed_size = 50
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))+1
embedding_matrix = np.zeros((nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    if i >= nb_words: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector