tf.keras.preprocessing.text.Tokenizer笔记

tensorflow中的Tokenizer类

官方链接

  • Tokenizer官方链接

该类包含的方法

方法 参数 返回值及注释
fit_on_texts(texts) 文本列表 训练好的分词器
fit_on_sequences(sequences) 序列列表 返回训练好的分词器,在使用sequence_to_matrix之前是必需的(如果从未调用fit_on_texts)。
get_config 以字典的形式返回分词器的详细信息。
sequences_to_matrix(sequences) 序列列表 将序列列表转化为向量列表。
sequences_to_texts(sequences) 序列列表 文本列表
sequences_to_texts_generator(sequences) 序列列表 返回一个迭代器,可以迭代生成文本序列。
texts_to_matrix(texts) 文本列表 返回文本向量
texts_to_sequences(texts) 文本列表 序列列表
texts_to_sequences_generator(texts) 文本列表 texts_to_sequences()的生成器函数。
to_json() 返回一个tokenizer详细信息的json文件,保存后可以后续加载使用。加载方式:keras.preprocessing.text.tokenizer_from_json(json_string).

代码实例

from tensorflow.keras.preprocessing.text import Tokenizer
sentences = ["Hello, nice to meet you.",
             "Nice to meet you too!", "Hello, Bob"]
tokenizer = Tokenizer(num_words=1)
tokenizer = tf.keras.preprocessing.text.Tokenizer(
                num_words=None,
                filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                lower=True, split=' ', char_level=False, oov_token=None,
            )
tokenizer.fit_on_texts(sentences)
print("tokenizer config:\n", tokenizer.get_config())

输出:
tokenizer config:
{‘num_words’: None, ‘filters’: ‘!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n’, ‘lower’: True, ‘split’: ’ ', ‘char_level’: False, ‘oov_token’: None, ‘document_count’: 3, ‘word_counts’: ‘{“hello”: 2, “nice”: 2, “to”: 2, “meet”: 2, “you”: 2, “too”: 1, “bob”: 1}’, ‘word_docs’: ‘{“nice”: 2, “meet”: 2, “hello”: 2, “to”: 2, “you”: 2, “too”: 1, “bob”: 1}’, ‘index_docs’: ‘{“2”: 2, “4”: 2, “1”: 2, “3”: 2, “5”: 2, “6”: 1, “7”: 1}’, ‘index_word’: ‘{“1”: “hello”, “2”: “nice”, “3”: “to”, “4”: “meet”, “5”: “you”, “6”: “too”, “7”: “bob”}’, ‘word_index’: ‘{“hello”: 1, “nice”: 2, “to”: 3, “meet”: 4, “you”: 5, “too”: 6, “bob”: 7}’}

print("word_index:\n", tokenizer.word_index)

输出:
word_index:
{‘hello’: 1, ‘nice’: 2, ‘to’: 3, ‘meet’: 4, ‘you’: 5, ‘too’: 6, ‘bob’: 7}

print("word_counts:\n", tokenizer.word_counts)

输出:
word_counts:
OrderedDict([(‘hello’, 2), (‘nice’, 2), (‘to’, 2), (‘meet’, 2), (‘you’, 2), (‘too’, 1), (‘bob’, 1)])

print('texts_to_sequences:\n', tokenizer.texts_to_sequences(['Bob, nice to meet you.']))

输出:
texts_to_sequences:
[[7, 2, 3, 4, 5]]

print('texts_to_matrix:\n', tokenizer.texts_to_matrix(['Bob, nice to meet you.'], mode='binary'))
# 参数mode取值 "binary", "count", "tfidf", "freq",默认"binary"

输出:
texts_to_matrix:
[[0. 0. 1. 1. 1. 1. 0. 1.]]

it = tokenizer.texts_to_sequences_generator(['Hello, Bob.', 'Nice to meet you.'])
print(next(it))
print(next(it))

输出:
[1, 7]
[2, 3, 4, 5]

print(tokenizer.to_json())

输出:
{“class_name”: “Tokenizer”, “config”: {“num_words”: null, “filters”: “!”#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n", “lower”: true, “split”: " ", “char_level”: false, “oov_token”: null, “document_count”: 3, “word_counts”: “{“hello”: 2, “nice”: 2, “to”: 2, “meet”: 2, “you”: 2, “too”: 1, “bob”: 1}”, “word_docs”: “{“nice”: 2, “meet”: 2, “hello”: 2, “to”: 2, “you”: 2, “too”: 1, “bob”: 1}”, “index_docs”: “{“2”: 2, “4”: 2, “1”: 2, “3”: 2, “5”: 2, “6”: 1, “7”: 1}”, “index_word”: “{“1”: “hello”, “2”: “nice”, “3”: “to”, “4”: “meet”, “5”: “you”, “6”: “too”, “7”: “bob”}”, “word_index”: “{“hello”: 1, “nice”: 2, “to”: 3, “meet”: 4, “you”: 5, “too”: 6, “bob”: 7}”}}

你可能感兴趣的:(NLP)