记录论文过程中有关tensorflow的学习。
直接看文档
为了有效读取数据并将数据有效保存在一组文件里线性读取。如果数据正在输入网络中,和对数据预处理缓存也是大有好处。
TFRecord是一种简单的格式用于处理序列化二进制记录。
protocol buffers是一个跨平台的,跨语言的库,用于序列化数据 。查了下是什么,官网给出例示如下。
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
tf.Example的消息就是一种灵活形式的protobuf,专门设计给tensorflow用的。
例示代码,官方给出接受数据的代码
from __future__ import absolute_import, division, print_function, unicode_literals
try:
# %tensorflow_version only exists in Colab.
!pip install tf-nightly
except Exception:
pass
import tensorflow as tf
import numpy as np
import IPython.display as display
# The following functions can be used to convert a value to a type compatible
# with tf.Example.
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
紧接着测试下数据,看下是什么结果
print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))
print(_float_feature(np.exp(1)))
print(_int64_feature(True))
print(_int64_feature(1))
结果形式很protobuf
bytes_list {
value: "test_string"
}
bytes_list {
value: "test_bytes"
}
float_list {
value: 2.7182817459106445
}
int64_list {
value: 1
}
int64_list {
value: 1
}
接着序列化
feature = _float_feature(np.exp(1))
feature.SerializeToString()
结果
b'\x12\x06\n\x04T\xf8-@'
有了读入,接着就是怎么样根据数据构造自己的数据结构了,主要是构造映射字典。
可以归结为以下步骤:
1、对每个观测值都需要转换成tf.train.Feature 三种兼容数据模式之一。
2、创建由1得到的键值对字典(反复嵌套的形式),并对其encode
3、最后转化成Features:message形式
官方继续给出例示
# The number of observations in the dataset. 数据集中的观测值数目
n_observations = int(1e4)
# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)
# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)
# String feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]
# Float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)
def serialize_example(feature0, feature1, feature2, feature3):
"""
Creates a tf.Example message ready to be written to a file.
"""
# Create a dictionary mapping the feature name to the tf.Example-compatible
# data type.
feature = {
'feature0': _int64_feature(feature0),
'feature1': _int64_feature(feature1),
'feature2': _bytes_feature(feature2),
'feature3': _float_feature(feature3),
}
# Create a Features message using tf.train.Example.
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString() 返回序列化的结构
# This is an example observation from the dataset.
example_observation = []
serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example
结果
b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'
还原成protobuf形式如下
example_proto = tf.train.Example.FromString(serialized_example)
example_proto
结果
features {
feature {
key: "feature0"
value {
int64_list {
value: 0
}
}
}
feature {
key: "feature1"
value {
int64_list {
value: 4
}
}
}
feature {
key: "feature2"
value {
bytes_list {
value: "goat"
}
}
}
feature {
key: "feature3"
value {
float_list {
value: 0.9876000285148621
}
}
}
}
未完待续