方法 | 参数 | 返回值及注释 |
---|---|---|
fit_on_texts(texts) | 文本列表 | 训练好的分词器 |
fit_on_sequences(sequences) | 序列列表 | 返回训练好的分词器,在使用sequence_to_matrix之前是必需的(如果从未调用fit_on_texts)。 |
get_config | 无 | 以字典的形式返回分词器的详细信息。 |
sequences_to_matrix(sequences) | 序列列表 | 将序列列表转化为向量列表。 |
sequences_to_texts(sequences) | 序列列表 | 文本列表 |
sequences_to_texts_generator(sequences) | 序列列表 | 返回一个迭代器,可以迭代生成文本序列。 |
texts_to_matrix(texts) | 文本列表 | 返回文本向量 |
texts_to_sequences(texts) | 文本列表 | 序列列表 |
texts_to_sequences_generator(texts) | 文本列表 | texts_to_sequences()的生成器函数。 |
to_json() | 无 | 返回一个tokenizer详细信息的json文件,保存后可以后续加载使用。加载方式:keras.preprocessing.text.tokenizer_from_json(json_string). |
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = ["Hello, nice to meet you.",
"Nice to meet you too!", "Hello, Bob"]
tokenizer = Tokenizer(num_words=1)
tokenizer = tf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=' ', char_level=False, oov_token=None,
)
tokenizer.fit_on_texts(sentences)
print("tokenizer config:\n", tokenizer.get_config())
输出:
tokenizer config:
{‘num_words’: None, ‘filters’: ‘!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n’, ‘lower’: True, ‘split’: ’ ', ‘char_level’: False, ‘oov_token’: None, ‘document_count’: 3, ‘word_counts’: ‘{“hello”: 2, “nice”: 2, “to”: 2, “meet”: 2, “you”: 2, “too”: 1, “bob”: 1}’, ‘word_docs’: ‘{“nice”: 2, “meet”: 2, “hello”: 2, “to”: 2, “you”: 2, “too”: 1, “bob”: 1}’, ‘index_docs’: ‘{“2”: 2, “4”: 2, “1”: 2, “3”: 2, “5”: 2, “6”: 1, “7”: 1}’, ‘index_word’: ‘{“1”: “hello”, “2”: “nice”, “3”: “to”, “4”: “meet”, “5”: “you”, “6”: “too”, “7”: “bob”}’, ‘word_index’: ‘{“hello”: 1, “nice”: 2, “to”: 3, “meet”: 4, “you”: 5, “too”: 6, “bob”: 7}’}
print("word_index:\n", tokenizer.word_index)
输出:
word_index:
{‘hello’: 1, ‘nice’: 2, ‘to’: 3, ‘meet’: 4, ‘you’: 5, ‘too’: 6, ‘bob’: 7}
print("word_counts:\n", tokenizer.word_counts)
输出:
word_counts:
OrderedDict([(‘hello’, 2), (‘nice’, 2), (‘to’, 2), (‘meet’, 2), (‘you’, 2), (‘too’, 1), (‘bob’, 1)])
print('texts_to_sequences:\n', tokenizer.texts_to_sequences(['Bob, nice to meet you.']))
输出:
texts_to_sequences:
[[7, 2, 3, 4, 5]]
print('texts_to_matrix:\n', tokenizer.texts_to_matrix(['Bob, nice to meet you.'], mode='binary'))
# 参数mode取值 "binary", "count", "tfidf", "freq",默认"binary"
输出:
texts_to_matrix:
[[0. 0. 1. 1. 1. 1. 0. 1.]]
it = tokenizer.texts_to_sequences_generator(['Hello, Bob.', 'Nice to meet you.'])
print(next(it))
print(next(it))
输出:
[1, 7]
[2, 3, 4, 5]
print(tokenizer.to_json())
输出:
{“class_name”: “Tokenizer”, “config”: {“num_words”: null, “filters”: “!”#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n", “lower”: true, “split”: " ", “char_level”: false, “oov_token”: null, “document_count”: 3, “word_counts”: “{“hello”: 2, “nice”: 2, “to”: 2, “meet”: 2, “you”: 2, “too”: 1, “bob”: 1}”, “word_docs”: “{“nice”: 2, “meet”: 2, “hello”: 2, “to”: 2, “you”: 2, “too”: 1, “bob”: 1}”, “index_docs”: “{“2”: 2, “4”: 2, “1”: 2, “3”: 2, “5”: 2, “6”: 1, “7”: 1}”, “index_word”: “{“1”: “hello”, “2”: “nice”, “3”: “to”, “4”: “meet”, “5”: “you”, “6”: “too”, “7”: “bob”}”, “word_index”: “{“hello”: 1, “nice”: 2, “to”: 3, “meet”: 4, “you”: 5, “too”: 6, “bob”: 7}”}}