LIQING LIN

13_Loading and Preprocessing Data from multiple CSV with TensorFlow_custom training loop_TFRecord

So far we have used only datasets that fit in memory, but Deep Learning systems are often trained on very large datasets that will not fit in RAM. Ingesting摄取 a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset object, and tell it where to get the data and how to transform it. TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching. Moreover, the Data API works seamlessly with tf.keras!

Off the shelf, the Data API can read from text files (such as CSV files), binary files with fixed-size records, and binary files that use TensorFlow’s TFRecord format, which supports records of varying sizes. TFRecord is a flexible and efficient binary format usually containing protocol buffers (an open source binary format). The Data API also has support for reading from SQL databases. Moreover, many open source extensions are available to read from all sorts of data sources, such as Google’s Big‐Query service.

Reading huge datasets efficiently is not the only difficulty: the data also needs to be preprocessed, usually normalized. Moreover, it is not always composed strictly of convenient numerical fields: there may be text features, categorical features, and so on. These need to be encoded, for example using one-hot encoding, bag-of-words encoding, or embeddings (as we will see, an embedding is a trainable dense vector that represents a category or token). One option to handle all this preprocessing is to
write your own custom preprocessing layers. Another is to use the standard preprocessing layers provided by Keras.

In this chapter, we will cover the Data API, the TFRecord format, and how to create custom preprocessing layers and use the standard Keras ones. We will also take a quick look at a few related projects from TensorFlow’s ecosystem:

TF Transform (tf.Transform)

Makes it possible to write a single preprocessing function that can be run in batch mode on your full training set, before training (to speed it up), and then exported to a TF Function and incorporated into your trained model so that once it is deployed in production it can take care of preprocessing new instances on the fly.

TF Datasets (TFDS)

Provides a convenient function to download many common datasets of all kinds, including large ones like ImageNet, as well as convenient dataset objects to manipulate them using the Data API.

So let’s get started!

The Data API

The whole Data API revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items. Usually you will use datasets that gradually read data from disk, but for simplicity
let’s create a dataset entirely in RAM using tf.data.Dataset.from_tensor_slices():

import tensorflow as tf

X =tf.range(10)#
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

The from_tensor_slices() function takes a tensor and creates a tf.data.Dataset whose elements are all the slices of X (along the first dimension), so this dataset contains 10 items: tensors 0, 1, 2, …, 9.

for item in dataset:
    print(item)

In this case we would have obtained the same dataset if we had used tf.data.Dataset.range(10).

dataset2=tf.data.Dataset.range(10)
dataset2

You can simply iterate over a dataset’s items like this:

for item in dataset2:
    print(item)

Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so you can chain transformations like this (this chain is illustrated in Figure 13-1):

Figure 13-1. Chaining dataset transformations dataset

dataset  = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

In this example, we first call the repeat() method on the original dataset, and it returns a new dataset that will repeat the items of the original dataset three times. Of course, this will not copy all the data in memory three times! (If you call this repeat() method with no arguments, the new dataset will repeat the source dataset forever, so the code that iterates over the dataset will have to decide when to stop.)
Then we call the batch() method on this new dataset, and again this creates a new dataset. This one will group the items of the previous dataset in batches of seven items. Finally, we iterate over the items of this final dataset by using for loop statement. As you can see, the batch() method had to output a final batch of size two instead of seven, but you can call it with drop_remainder=True if you want it to drop this final batch so that all batches have the exact same size.

The dataset methods do not modify datasets, they create new ones, so make sure to keep a reference to these new datasets (e.g., with dataset = ...), or else nothing will happen.

You can also transform the items by calling the map() method. For example, this creates a new dataset with all items doubled:

dataset = dataset.map(lambda x: x*2) # [0 1 2 3 4 5 6] ==> [0,2,4,6,8,10,12]

for item in dataset:
    print(item)

This function ###dataset = dataset.map(lambda x: x * 2)### is the one you will call to apply any preprocessing you want to your data. Sometimes this will include computations that can be quite intensive, such as reshaping or rotating an image, so you will usually want to spawn multiple threads to speed things up: it’s as simple as setting the num_parallel_calls argument. Note that the function you pass to the map() method must be convertible to a TF Function (see Chapter 12).

While the map() method applies a transformation to each item, the apply() method applies a transformation to the dataset as a whole. For example, the following code applies the unbatch() function to the dataset (this function is currently experimental,
but it will most likely move to the core API in a future release). Each item in the new dataset will be a single-integer tensor instead of a batch of seven integers:

#dataset = dataset.apply(tf.data.experimental.unbatch()) # Now deprecated
dataset = dataset.unbatch()

for item in dataset:
    print(item)

It is also possible to simply filter the dataset using the filter() method:

dataset = dataset.filter(lambda x: x<10) # Items: 0 2 4 6 8 0 2 4 6...

for item in dataset:
    print(item)

You will often want to look at just a few items from a dataset. You can use the take() method for that:

for item in dataset.take(3):
    print(item)

Shuffling the Data

As you know, Gradient Descent works best when the instances in the training set are independent and identically distributed分布均匀 (see Chapter 4). A simple way to ensure this is to shuffle the instances, using the shuffle() method. It will create a new dataset that will start by filling up a buffer with the first items of the source dataset. Then, whenever it is asked for an item, it will pull one out randomly from the buffer and replace it with a fresh one from the source dataset, until it has iterated entirely through the source dataset. At this point it continues to pull out items randomly from the buffer until it is empty. You must specify the buffer size, and it is important to make it large enough, or else shuffling will not be very effective. Just don’t exceed the amount of RAM you have, and even if you have plenty of it, there’s no need to go beyond the dataset’s size. You can provide a random seed if you want the same random order every time you run your program.
For example, the following code creates and displays a dataset containing the integers 0 to 9, repeated 3 times, shuffled using a buffer of size 3 and a random seed of 42, and batched with a batch size of 7:

tf.data.Dataset.shuffle(buffer_size): https://www.jianshu.com/p/1285036e314c

tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10).repeat(3) #(0~9, 0~9, 0~9)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)

for item in dataset:
    print(item)

If you call repeat() on a shuffled dataset, by default it will generate a new order at every iteration. This is generally a good idea, but if you prefer to reuse the same order at each iteration (e.g., for tests or debugging), you can set reshuffle_each_iteration=False.

tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10)
# tf.Tensor(0, shape=(), dtype=int64)
# tf.Tensor(1, shape=(), dtype=int64)
# tf.Tensor(2, shape=(), dtype=int64)
# tf.Tensor(3, shape=(), dtype=int64)
# tf.Tensor(4, shape=(), dtype=int64)
# tf.Tensor(5, shape=(), dtype=int64)
# tf.Tensor(6, shape=(), dtype=int64)
# tf.Tensor(7, shape=(), dtype=int64)
# tf.Tensor(8, shape=(), dtype=int64)
# tf.Tensor(9, shape=(), dtype=int64)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)
# tf.Tensor([1 3 0 4 2 5 6], shape=(7,), dtype=int64)
# tf.Tensor([8 7 9], shape=(3,), dtype=int64)
dataset = dataset.repeat(3)
for item in dataset:
    print(item)

tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10)
# tf.Tensor(0, shape=(), dtype=int64)
# tf.Tensor(1, shape=(), dtype=int64)
# tf.Tensor(2, shape=(), dtype=int64)
# tf.Tensor(3, shape=(), dtype=int64)
# tf.Tensor(4, shape=(), dtype=int64)
# tf.Tensor(5, shape=(), dtype=int64)
# tf.Tensor(6, shape=(), dtype=int64)
# tf.Tensor(7, shape=(), dtype=int64)
# tf.Tensor(8, shape=(), dtype=int64)
# tf.Tensor(9, shape=(), dtype=int64)
dataset = dataset.shuffle(buffer_size=3, seed=42, reshuffle_each_iteration=False).batch(7)#########
# tf.Tensor([0 2 3 5 6 4 8], shape=(7,), dtype=int64)
# tf.Tensor([9 1 7], shape=(3,), dtype=int64)
dataset = dataset.repeat(3)##########################
for item in dataset:
    print(item)

tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10)
# tf.Tensor(0, shape=(), dtype=int64)
# tf.Tensor(1, shape=(), dtype=int64)
# tf.Tensor(2, shape=(), dtype=int64)
# tf.Tensor(3, shape=(), dtype=int64)
# tf.Tensor(4, shape=(), dtype=int64)
# tf.Tensor(5, shape=(), dtype=int64)
# tf.Tensor(6, shape=(), dtype=int64)
# tf.Tensor(7, shape=(), dtype=int64)
# tf.Tensor(8, shape=(), dtype=int64)
# tf.Tensor(9, shape=(), dtype=int64)
dataset = dataset.shuffle(buffer_size=3, seed=42, reshuffle_each_iteration=False).batch(7).repeat(3)###########
# tf.Tensor([0 2 3 5 6 4 8], shape=(7,), dtype=int64)
# tf.Tensor([9 1 7], shape=(3,), dtype=int64
for item in dataset:
    print(item)

For a large dataset that does not fit in memory, this simple shuffling-buffer approach may not be sufficient, since the buffer will be small compared to the dataset.One solution is to shuffle the source data itself (for example, on Linux you can shuffle text files using the shuf command). This will definitely improve shuffling a lot! Even if the source data is shuffled, you will usually want to shuffle it some more, or else the same order will be repeated at each epoch, and the model may end up being biased (e.g., due to some spurious patterns present by chance in the source data’s order).
To shuffle the instances some more, a common approach is to split the source data into multiple files, then read them in a random order during training. However, instances located in the same file will still end up close to each other. To avoid this you can pick multiple files randomly and read them simultaneously, interleaving交错 their records. Then on top of that you can add a shuffling buffer using the shuffle() method. If all this sounds like a lot of work, don’t worry: the Data API makes all this possible in just a few lines of code. Let’s see how to do this.

Interleaving lines from multiple files

Split the California dataset to multiple CSV files

Let's start by loading and preparing the California housing dataset. We first load it, then split it into a training set, a validation set and a test set, and finally we scale it:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42
)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42
)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

For a very large dataset that does not fit in memory, you will typically want to split it into many files first, then have TensorFlow read these files in parallel. To demonstrate this, let's start by splitting the housing dataset and save it to 20 CSV files:

import os

os.path

os.makedirs(name, mode=0o777, exist_ok=False)

作用

用来创建多层目录（单层请用os.mkdir)

参数说明

name：你想创建的目录名
mode：要为目录设置的权限数字模式，默认的模式为 0o777 (八进制)。
exist_ok：是否在目录存在时触发异常。如果exist_ok为False（默认值），则在目标目录已存在的情况下触发FileExistsError异常；如果exist_ok为True，则在目标目录已存在的情况下不会触发FileExistsError异常。

makedirs(name, mode=511, exist_ok=False)
makedirs(name [, mode=0o777][, exist_ok=False])

Super-mkdir; create a leaf directory and all intermediate ones. Works like mkdir, except that any intermediate path segment (not just the rightmost) will be created if it does not exist. If the target directory already exists, raise an OSError if exist_ok is False. Otherwise no exception is raised. This is recursive.

import numpy as np

def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs( housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
    
    filepaths = []
    m = len(data)
    
    # file_idx: group indices of multiple arrays
    # row_indices: element indices in an array
    for file_idx, row_indices in enumerate( np.array_split(np.arange(m), n_parts) ): # iterate files
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        
        #"t": refers to the text mode
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
                
            for row_idx in row_indices: # iterate rows
                # str()主要面向用户，其目的是可读性，返回形式为用户友好性和可读性都较强的字符串类型；
                # repr() 函数将对象转化为供解释器读取的形式
                # 当我们想所有环境下都统一显示的话，可以重构__repr__方法；
                # 当我们想在不同环境下"支持不同的显示"，例如终端用户显示使用__str__，
                # 而程序员在开发期间则使用底层的__repr__来显示，实际上__str__只是覆盖了__repr__
                # 以得到更友好的用户显示。
                f.write(",".join([repr(col) for col in data[row_idx]
                                 ])
                       )
                f.write("\n")
                
    return filepaths

train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
                                      # y_name
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files( train_data, "train", header, n_parts=20 )
valid_filepaths = save_to_multiple_csv_files( valid_data, "valid", header, n_parts=10 )
test_filepaths = save_to_multiple_csv_files( test_data, "test", header, n_parts=10 )

Okay, now let's take a peek at the first few lines of one of these CSV files:

import pandas as pd

pd.read_csv( train_filepaths[0]).head()

Or in text mode:

with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

Now, you’ve loaded the California housing dataset, shuffled it (unless it was already shuffled), and split it into a training set, a validation set, and a test set. Then you split each set into many CSV files that each look like this (each row contains eight input features plus the target median house value).

Let’s also suppose train_filepaths contains the list of training file paths (and you also have valid_filepaths and test_filepaths):

train_filepaths

Alternatively, you could use file patterns; for example, train_filepaths = "datasets/housing/my_train_*.csv". Now let’s create a dataset containing only these file paths:

@staticmethod
list_files(
    file_pattern, shuffle=None, seed=None
)

Note: The default behavior of this method is to return filenames in a non-deterministic random shuffled order. Pass a seed or shuffle=False to get results in a deterministic order.

file_pattern A string, a list of strings, or a tf.Tensor of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.

The file_pattern argument should be a small number of glob patterns . If your filenames have already been globbed(### a group : X =tf.range(10): ###), use Dataset.from_tensor_slices(filenames) instead(see previous example), as re-globbing every filename with list_files may result in poor performance with remote storage systems.
https://www.tensorflow.org/api_docs/python/tf/data/Dataset

filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

for filepath in filepath_dataset:
    print(filepath)

By default, the list_files() function returns a dataset that shuffles the file paths. In general this is a good thing, but you can set shuffle=False if you do not want that for some reason.

Alternatively, you could use file patterns; for example, train_filepaths = "datasets/housing/my_train_*.csv". Now let’s create a dataset containing only these file paths:

train_filepaths= "datasets/housing/my_train_*.csv"
train_filepaths

filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

for filepath in filepath_dataset:
    print(filepath)

Next, you can call the interleave() method to read from five files (n_readers = 5) at a time and interleave their lines (skipping the first line of each file, which is the header row, using the skip() method):

n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1), #skip the header row #map_func
    cycle_length=n_readers
)

The interleave() method will create a dataset that will pull five file paths(cycle_length=n_readers) from the filepath_dataset, and for each one(filepath) it will call the function you gave it (a lambda in this example) to create a new dataset (in this case a TextLineDataset).

To be clear, at this stage there will be seven datasets in all: the filepath dataset, the interleave dataset, and the 5 TextLineDatasets created internally by the interleave dataset.

When we iterate over the interleave dataset, it will cycle through these five TextLineDatasets, reading one line at a time from each until all datasets are out of items. Then it will get the next five file paths from the filepath_dataset and interleave them the same way, and so on until it runs out of file paths.

For interleaving to work best, it is preferable to have files of identical length; otherwise the ends of the longest files will not be interleaved.

interleave(
    map_func, cycle_length=None, block_length=None, num_parallel_calls=None,
    deterministic=None
)

By default, interleave() does not use parallelism; it just reads one line at a time from each file, sequentially. If you want it to actually read files in parallel, you can set the num_parallel_calls argument to the number of threads you want (note that the map() method also has this argument). You can even set it to tf.data.experimental.AUTOTUNE to make TensorFlow choose the right number of threads dynamically based on the available CPU (however, this is an experimental feature for now). Let’s look at what the dataset contains now:

for line in dataset.take(5):
    print(line.numpy())

These are the first rows (ignoring the header row) of five CSV files, chosen randomly. Looks good! But as you can see, these are just byte strings; we need to parse them and scale the data.

At first,

Notice that field 4 is interpreted as a string.
[0, np.nan, tf.constant(np.nan, dtype=tf.float64), "Hello", tf.constant([])]

tf.io.decode_csv(
    records, record_defaults, field_delim=',', use_quote_delim=True, na_value='',
    select_cols=None, name=None
)

Convert CSV records to tensors. Each column maps to one tensor.
record_defaults: A list of Tensor objects with specific types. Acceptable types are
float32, float64, int32, int64, string. One tensor per column of the input record, with either a scalar default value for that column or an empty vector if the column is required.

record_defaults = [0, 
                    np.nan, 
                    tf.constant(np.nan, dtype=tf.float64), 
                    "Hello", 
                    tf.constant([])
                   ]
parsed_fields = tf.io.decode_csv( '1,2,3,4,5', record_defaults )
parsed_fields

Note the string '4' in 4th tensor
Notice that all missing fields are replaced with their default value(here is 5.0), when provided:

parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields

The 5th field is compulsory[kəmˈpəlsəri] 必须做的，强制性的(since we provided tf.constant([]) as the "default value"), so we get an exception if we do not provide it:

try:
    parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

Loading and Preprocessing Data from multiple CSVs, then training

Preprocessing the Data

Let’s implement a small function that will perform this preprocessing:
https://blog.csdn.net/Linli522362242/article/details/107459161
Once the optimized graph is ready, the TF Function efficiently executes the operations in the graph, in the appropriate order (and in parallel when it can). As a result, a TF Function will usually run much faster than the original Python function, especially if it performs complex computations

# scaler = StandardScaler()
# scaler.fit(X_train)
# X_mean = scaler.mean_
# X_std = scaler.scale_

n_inputs = 8 # X_train.shape[-1] # X_train.shape=(11610, 8)

@tf.function
def preprocess(line):
    defs =[0.]*n_inputs + [tf.constant([], dtype=tf.float32)] # record_defaults
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean)/X_std, y

Let’s walk through this code:

First, the code assumes that we have precomputed the mean and standard deviation of each feature in the training set. X_mean and X_std are just 1D tensors (or NumPy arrays) containing eight floats, one per input feature.
The preprocess() function takes one CSV line and starts by parsing it. For this it uses the tf.io.decode_csv() function, which takes two arguments: the first is the line to parse, and the second is an array containing the default value for each column in the CSV file. This array tells TensorFlow not only the default value for each column, but also the number of columns and their types. In this example, we tell it that all feature columns are floats and that missing values should default to 0., but we provide an empty array of type tf.float32 as the default value for the last column (the target): the array tells TensorFlow that this column contains floats, but that there is no default value, so it will raise an exception if it encounters a missing value.
The decode_csv() function returns a list of scalar tensors (one per column), but we need to return 1D tensor arrays. So we call tf.stack() on all tensors except for the last one (the target): this will stack these tensors into a 1D array with multiple values. We then do the same for the target value (this makes it a 1D tensor array with a single value, rather than a scalar tensor).
Finally, we scale the input features by subtracting the feature means and then dividing by the feature standard deviations, and we return a tuple containing the scaled features and the target.

Let’s test this preprocessing function:

preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

Looks good! We can now apply the function to the dataset.

Putting Everything Together

To make the code reusable, let’s put together everything we have discussed so far into a small helper function: it will create and return a dataset that will efficiently load California housing data from multiple CSV files, preprocess it, shuffle it, optionally
repeat it, and batch it (see Figure 13-2):

def csv_reader_dataset(filepaths, repeat=1, 
                       n_readers=5,# number of files or filepaths
                       n_read_threads=None,
                       shuffle_buffer_size=10000,
                       n_parse_threads=5,
                       batch_size=32
                      ):
    ######### pick multiple files randomly and read them simultaneously, interleaving交错 their records #########
    # list_files() function returns a dataset that "shuffles" the file paths, then repeat 'repeat' times
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)

    # interleave() method to read from cycle_length(=n_readers) files at a time and 
    # "interleave their lines" (called cycle) : reading one line at a time from each until all datasets are out of items
    # Then it will get the next five file paths from the 'dataset' and interleave them the same way,
    # and so on until it runs out of file paths
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1), # skip the header row via map_func
        cycle_length=n_readers,# 'interleave' pull cycle_length(=n_readers) file paths(1 by 1) from the 'dataset'
        num_parallel_calls=n_read_threads
    )#and for each one(filepath) it will call the function you gave it(lambda) to create a new dataset(TextLineDataset)
    #the interleave dataset including cycle_length(=n_readers) datasets
    
    ############## Then on top of that you can add a shuffling buffer using the shuffle() method ##############
    dataset = dataset.shuffle(shuffle_buffer_size)    
    dataset = dataset.map(preprocess, #split and combine to form x_train and y_train, then scale
                          num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)#group the items of the previous dataset in batches of 'batch_size' items
    
    return dataset.prefetch(1)

Everything should make sense in this code, except the very last line (prefetch(1)), which is important for performance.

Figure 13-2. Loading and preprocessing data from multiple CSV files

Prefetching

By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch
ready (e.g., reading the data from disk and preprocessing it). This can improve performance dramatically, as is illustrated in Figure 13-3. If we also ensure that loading and preprocessing are multithreaded (by setting num_parallel_calls when calling interleave() and map()), we can exploit multiple cores on the CPU and hopefully make preparing one batch of data shorter than running a training step on the GPU:

this way the GPU will be almost 100% utilized (except for the data transfer time from the CPU to the GPU3), and training will run much faster.
Figure 13-3. With prefetching, the CPU and the GPU work in parallel: as the GPU works on one batch, the CPU works on the next

If you plan to purchase a GPU card, its processing power and its memory size are of course very important (in particular, a large amount of RAM is crucial for computer vision). Just as important to get good performance is its memory bandwidth; this is the number of gigabytes of data it can get into or out of its RAM per second.

tf.random.set_seed(42)

train_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in train_set.take(2):
    print('X =', X_batch)
    print('y =', y_batch)
    print()

If the dataset is small enough to fit in memory, you can significantly speed up training by using the dataset’s cache() method to cache its content to RAM. You should generally do this after loading and preprocessing the data, but before shuffling, repeating, batching, and prefetching. This way, each instance will only be read and preprocessed once (instead of once per epoch), but the data will still be shuffled differently at each epoch, and the next batch will still be prepared in advance.

You now know how to build efficient input pipelines to load and preprocess data from multiple text files. We have discussed the most common dataset methods, but there are a few more you may want to look at: concatenate(), zip(), window(), reduce(), shard(), flat_map(), and padded_batch(). There are also a couple more class methods: from_generator() and from_tensors(), which create a new dataset from a Python generator or a list of tensors, respectively. Please check the API documentation for more details. Also note that there are experimental features available in tf.data.experimental, many of which will likely make it to the core API in future releases (e.g., check out the CsvDataset class, as well as the make_csv_dataset() method, which takes care of inferring the type of each column).

Here is a short description of each method in the Dataset class:

for m in dir(tf.data.Dataset):
    if not (m.startswith("_") or m.endswith("_")):
        func = getattr(tf.data.Dataset, m)
        if hasattr(func, "__doc__"):
            print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))

● apply()              Applies a transformation function to this dataset.
● as_numpy_iterator()  Returns an iterator which converts all elements of the dataset to numpy.
● batch()              Combines consecutive elements of this dataset into batches.
● cache()              Caches the elements in this dataset.
● concatenate()        Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec()       The type specification of an element of this dataset.
● enumerate()          Enumerates the elements of this dataset.
● filter()             Filters this dataset according to `predicate`.
● flat_map()           Maps `map_func` across this dataset and flattens the result.
● from_generator()     Creates a `Dataset` whose elements are generated by `generator`.
● from_tensor_slices() Creates a `Dataset` whose elements are slices of the given tensors.
● from_tensors()       Creates a `Dataset` with a single element, comprising the given tensors.
● interleave()         Maps `map_func` across this dataset, and interleaves the results.
● list_files()         A dataset of all files matching one or more glob patterns.
● map()                Maps `map_func` across the elements of this dataset.
● options()            Returns the options for this dataset and its inputs.
● padded_batch()       Combines consecutive elements of this dataset into padded batches.
● prefetch()           Creates a `Dataset` that prefetches elements from this dataset.
● range()              Creates a `Dataset` of a step-separated range of values.
● reduce()             Reduces the input dataset to a single element.
● repeat()             Repeats this dataset so each original value is seen `count` times.
● shard()              Creates a `Dataset` that includes only 1/`num_shards` of this dataset.
● shuffle()            Randomly shuffles the elements of this dataset.
● skip()               Creates a `Dataset` that skips `count` elements from this dataset.
● take()               Creates a `Dataset` with at most `count` elements from this dataset.
● unbatch()            Splits elements of a dataset into multiple elements.
● window()             Combines (nests of) input elements into a dataset of (nests of) windows.
● with_options()       Returns a new `tf.data.Dataset` with the given options set.
● zip()                Creates a `Dataset` by zipping together the given datasets.

Using the Dataset with tf.keras

Now we can use the csv_reader_dataset() function to create a dataset for the training set. Note that we do not need to repeat it, as this will be taken care of by tf.keras. We also create datasets for the validation set and the test set:

train_set = csv_reader_dataset(train_filepaths, repeat=None) # 'PrefetchDataset'
valid_set = csv_reader_dataset(valid_filepaths) # 'PrefetchDataset'
test_set = csv_reader_dataset(test_filepaths) # 'PrefetchDataset'

And now we can simply build and train a Keras model using these datasets. All we need to do is pass the training and validation datasets to the fit() method, instead of X_train, y_train, X_valid, and y_valid:

from tensorflow import keras
import numpy as np

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
    keras.layers.Dense(30, activation='relu', input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])

model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

batch_size=32 #default
model.fit(train_set, steps_per_epoch=len(X_train)//batch_size, #=362
          epochs=10,
          validation_data=valid_set)

The fit() method will take care of repeating the training dataset. Alternatively, you could call repeat() on the training dataset so that it repeats forever and specify the steps_per_epoch argument when calling the fit() method. This may be useful in some rare cases, for example if you want to use a shuffle buffer that crosses over epochs.

Similarly, we can pass a dataset to the evaluate() and predict() methods:

model.evaluate( test_set, steps=len(X_test)//batch_size)

test_set

test_set includes labels

#csv_reader_dataset(test_filepaths)
new_set = test_set# we could instead just pass test_set, Keras would ignore the labels
X_new = X_test
model.predict(new_set, steps=len(X_new)//batch_size)

new_set

Alternatively, re-run from train_set = csv_reader_dataset(train_filepaths, repeat=None) to

#csv_reader_dataset(test_filepaths)
new_set = test_set.map(lambda X, y:X) # we could instead just pass test_set, Keras would ignore the labels
X_new = X_test
model.predict(new_set, steps=len(X_new)//batch_size)

lambda X, y:X ==> just return X

new_set

# X_new = X_test
X_new.shape

Unlike the other sets, the new_set will usually not contain labels (if it does, Keras will ignore them). Note that in all these cases, you can still use NumPy arrays instead of datasets if you want (but of course they need to have been loaded and preprocessed first).

Custom Training Loop after Loading and Preprocessing Data from multiple CSVs

If you want to build your own custom training loop (as in cp12 _Custom Models and Training with TensorFlow_2_progress_status_bar_Training Loops_concrete https://blog.csdn.net/Linli522362242/article/details/107459161), you can just iterate over the training set, very naturally:

# train_filepaths = save_to_multiple_csv_files( train_data, "train", header, n_parts=20 ) 

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])

optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error # returns one loss per instance(with few features)

@tf.function
def train(model, n_epochs, batch_size=32, 
          n_reader=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
    train_set = csv_reader_dataset(train_filepaths, 
                                   repeat = n_epochs, # list_files() returns a dataset that "shuffles" the file paths, then repeat
                                   n_readers=n_readers, # interleave() fread from cycle_length files at a time
                                   n_read_threads = n_read_threads,
                                   shuffle_buffer_size=shuffle_buffer_size, # shuffle
                                   n_parse_threads=n_parse_threads, #map(preprocess, num_parallel_calls = n_parse_threads)
                                   batch_size=batch_size, #batch()
                                   #prefetch()
                                  )
    ##############################################
    n_steps_per_epoch = len(X_train) // batch_size # 11610 // 32 = 362 steps per epochs
    total_steps = n_epochs * n_steps_per_epoch # if n_epochs=5, then total_steps=1810
    global_step = 0
    ##############################################
    
    for X_batch, y_batch in train_set.take(total_steps): # 11610//32 * 5 =1810 (taken times) # each time, take 32 instances
        
        #tracking
        global_step +=1
        if tf.equal( global_step % 100, 0):
            #'\r' moves the cursor ahead on current row
            tf.print("\rGlobal step", global_step, "/", total_steps )
            
        with tf.GradientTape() as tape:
            y_pred = model(X_batch) # prediction
            main_loss = tf.reduce_mean( loss_fn(y_batch, y_pred) )# returns a mean loss per batch
            loss = tf.add_n([main_loss] + model.losses) #model.losses: there is one "regularization loss" per layer
            
        # compute the gradient of the loss with regard to each trainable variable    
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients( zip(gradients, model.trainable_variables))

        ######################## constraint ########################

train(model, 5)

Congratulations, you now know how to build powerful input pipelines using the Data API! However, so far we have used CSV files, which are common, simple, and convenient but not really efficient, and do not support large or complex data structures (such as images or audio) very well. So let’s see how to use TFRecords instead.

If you are happy with CSV files (or whatever other format you are using), you do not have to use TFRecords. As the saying goes, if it ain’t broke, don’t fix it! TFRecords are useful when the bottleneck during training is loading and parsing the data.

The TFRecord Format

The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently. It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is comprised of a length, a
CRC checksum to check that the length was not corrupted, then the actual data, and finally a CRC checksum for the data). You can easily create a TFRecord file using the tf.io.TFRecordWriter class:

A TFRecord file is just a list of binary records. You can create one using a tf.io.TFRecordWriter:

tf.io.TFRecordWriter(
    path, options=None
)

with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"This is the seconde record")

And you can then use a tf.data.TFRecordDataset to read one or more TFRecord files:

tf.data.TFRecordDataset(
    filenames, compression_type=None, buffer_size=None, num_parallel_reads=None
)

filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
# dataset : 
for item in dataset:
    print(item)

This will output:

By default, a TFRecordDataset will read files one by one, but you can make it read multiple files in parallel and interleave their records by setting num_parallel_reads. Alternatively, you could obtain the same result by using list_files() and interleave() as we did earlier to read multiple CSV files.

filepaths = ["my_test_{}.tfrecord".format(i) for i in range(5)]
# ['my_test_0.tfrecord',
#  'my_test_1.tfrecord',
#  'my_test_2.tfrecord',
#  'my_test_3.tfrecord',
#  'my_test_4.tfrecord']
for i, filepath in enumerate(filepaths):
    with tf.io.TFRecordWriter(filepath) as f:
        for j in range(3):
            f.write("File {} record {}".format(i,j).encode('utf-8'))

You can read multiple TFRecord files with just one TFRecordDataset. By default it will read them one at a time, but if you set num_parallel_reads=3, it will read 3 at a time in parallel and interleave their records:

dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=3)
for item in dataset:
    print(item)

Alternatively, you could obtain the same result by using list_files() and interleave() as we did earlier to read multiple CSV files.

dataset = tf.data.Dataset.list_files(filepaths).repeat(1)
dataset = dataset.interleave(lambda filepath: tf.data.TFRecordDataset(filepath),#######
                             cycle_length=3,#######
                             num_parallel_calls =3########
                            )
for item in dataset:
    print(item)

Compressed TFRecord Files

It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection. You can create a compressed TFRecord file by setting the options argument:

options = tf.io.TFRecordOptions( compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    f.write(b"This the first record")
    f.write(b"This the second record")

When reading a compressed TFRecord file, you need to specify the compression type:

dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"], compression_type="GZIP")
for item in dataset:
    print(item)

A Brief Introduction to Protocol Buffers

Even though each record can use any binary format you want, TFRecord files usually contain serialized protocol buffers (also called protobufs). This is a portable, extensible, and efficient binary format developed at Google back in 2001 and made open source in 2008; protobufs are now widely used, in particular in gRPC, Google’s remote procedure call system. They are defined using a simple language that looks like this:

For this section you need to install protobuf: https://developers.google.com/protocol-buffers/docs/downloads. In general you will not have to do so when using TensorFlow, as it comes with functions to create and parse protocol buffers of type tf.train.Example, which are generally sufficient. However, in this section we will learn about protocol buffers by creating our own simple protobuf definition, so we need the protobuf compiler (protoc): we will use it to compile the protobuf definition to a Python module that we can then use in our code

First let's write a simple protobuf definition:

你指定的 message 字段可以是下面几种情况之一：

required: 格式良好的 message 必须包含该字段一次。
optional: 格式良好的 message 可以包含该字段零次或一次（不超过一次）。
repeated: 该字段可以在格式良好的消息中重复任意多次（包括零）。其中重复值的顺序会被保留。

%%writefile person.proto
syntax = "proto3";
message Person{
    string name =1;
    int32 id=2;
    repeated string email = 3;
}

This definition says we are using version 3 of the protobuf format, and it specifies that each Person object(Since protobuf objects are meant to be serialized and transmitted, they are called messages) may (optionally) have a name of type string, an id of type int32, and zero or more email fields, each of type string. The numbers 1, 2, and 3 are the field identifiers: they will be used in each record’s binary representation. Once you have a definition in a .proto file, you can compile it. This requires protoc, the protobuf compiler, to generate access classes in Python (or some other language).
Note that the protobuf definitions we will use have already been compiled for you, and their Python classes are part of TensorFlow, so you will not need to use protoc. All you need to know is how to use protobuf access classes in Python. To illustrate the basics, let’s look at a simple example that uses the access classes generated for the Person protobuf (the code is explained in the comments):

And let's compile it (the --descriptor_set_out and --include_imports options are only required for the tf.io.decode_proto() example below):

所在的源目录 生成python代码(person_pb2.py)

!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports

!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports

# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler.  DO NOT EDIT!
# source: person.proto

from google.protobuf import descriptor as _descriptor
from google.protobuf import message as _message
from google.protobuf import reflection as _reflection
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)

_sym_db = _symbol_database.Default()




DESCRIPTOR = _descriptor.FileDescriptor(
  name='person.proto',
  package='',
  syntax='proto3',
  serialized_options=None,
  serialized_pb=b'\n\x0cperson.proto\"1\n\x06Person\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\n\n\x02id\x18\x02 \x01(\x05\x12\r\n\x05\x65mail\x18\x03 \x03(\tb\x06proto3'
)




_PERSON = _descriptor.Descriptor(
  name='Person',
  full_name='Person',
  filename=None,
  file=DESCRIPTOR,
  containing_type=None,
  fields=[
    _descriptor.FieldDescriptor(
      name='name', full_name='Person.name', index=0,
      number=1, type=9, cpp_type=9, label=1,
      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
    _descriptor.FieldDescriptor(
      name='id', full_name='Person.id', index=1,
      number=2, type=5, cpp_type=1, label=1,
      has_default_value=False, default_value=0,
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
    _descriptor.FieldDescriptor(
      name='email', full_name='Person.email', index=2,
      number=3, type=9, cpp_type=9, label=3,
      has_default_value=False, default_value=[],
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR),
  ],
  extensions=[
  ],
  nested_types=[],
  enum_types=[
  ],
  serialized_options=None,
  is_extendable=False,
  syntax='proto3',
  extension_ranges=[],
  oneofs=[
  ],
  serialized_start=16,
  serialized_end=65,
)

DESCRIPTOR.message_types_by_name['Person'] = _PERSON
_sym_db.RegisterFileDescriptor(DESCRIPTOR)

Person = _reflection.GeneratedProtocolMessageType('Person', (_message.Message,), {
  'DESCRIPTOR' : _PERSON,
  '__module__' : 'person_pb2'
  # @@protoc_insertion_point(class_scope:Person)
  })
_sym_db.RegisterMessage(Person)


# @@protoc_insertion_point(module_scope)

!dir person* # !ls person* #note 'ls' is for linux system

from person_pb2 import Person

person = Person(name='Al', id = 123, email=["[email protected]"]) # create a Person
print(person) # display the Person

person.name # read a field

person.name = 'Alice' # modify a field
person.name

person.email[0] # repeated fields can be accessed like arrays

person.email.append('[email protected]') # add an email address
s = person.SerializePartialToString()# serialize the object to a byte string
s

# serialize the object to a byte string

person2 = Person() # create a new Person
person2.ParseFromString(s) # parse the byte string (27 bytes long)

# parse the byte string (27 bytes long)

person == person2 # now they are equal

In short, we import the Person class generated by protoc, we create an instance and play with it, visualizing it and reading and writing some fields, then we serialize it using the SerializeToString() method. This is the binary data that is ready to be
saved or transmitted over the network. When reading or receiving this binary data, we can parse it using the ParseFromString() method, and we get a copy of the object that was serialized.

person2.name, person2.id, person2.email

We could save the serialized Person object to a TFRecord file, then we could load and parse it: everything would work fine. However, SerializeToString() and ParseFrom String() are not TensorFlow operations (and neither are the other operations in this code), so they cannot be included in a TensorFlow Function (except by wrapping them in a tf.py_function() operation, which would make the code slower and less portable, as we saw in Chapter 12). Fortunately, TensorFlow does include special protobuf definitions for which it provides parsing operations.

TensorFlow Protobufs

The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset. It contains a list of named features, where each feature can either be a list of byte strings, a list of floats, or a list of integers. Here is the protobuf definition:

Custom protobuf

In rare cases, you may want to parse a custom protobuf (like the one we just created) in TensorFlow. For this you can use the tf.io.decode_proto() function:

person_tf = tf.io.decode_proto(
    bytes=s,
    message_type="Person",
    field_names=["name", "id", "email"],
    output_types=[tf.string, tf.int32, tf.string],
    descriptor_source="person.desc"
)

person_tf

person_tf.values

Here is the definition of the tf.train.Example protobuf:

The numbers 1, 2, and 3 are the field identifiers: they will be used in each record’s binary representation

syntax = "proto3";

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map feature = 1; };
message Example { Features features = 1; };

The definitions of BytesList, FloatList, and Int64List are straightforward enough. Note that [packed = true] is used for repeated numerical fields, for a more efficient encoding. A Feature contains either a BytesList, a FloatList, or an Int64List. A Features (with an s) contains a dictionary that maps a feature name to the corresponding feature value. And finally, an Example contains only a Features object. Here is how you could create a tf.train.Example representing the same person as earlier and write it to a TFRecord file:

Warning: there's currently a bug preventing from tensorflow.train import X so we work around it by writing X = tf.train.X. See https://github.com/tensorflow/tensorflow/issues/33289 for more details

from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
 in 
----> 1 from tensorflow.train import BytesList, FloatList, Int64List
      2 from tensorflow.train import Feature, Features, Example

ModuleNotFoundError: No module named 'tensorflow.train'

Solution:

import tensorflow as tf

BytesList = tf.train.BytesList
FloatList = tf.train.FloatList
Int64List = tf.train.Int64List

Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example

person_example = Example(
    #message 'Example' { Features 'features' = 1; };
    features = Features(
        #message 'Features' { map 'feature' = 1; };
        feature={
            #map
            "name": Feature( bytes_list=BytesList( value=[b"Alice"] ) ),
            "id": Feature( int64_list=Int64List( value=[123] ) ),
            "emails": Feature( bytes_list=BytesList( value=[b"[email protected]", 
                                                            b"[email protected]"]
                                                   )
                             )
        }
    )
)

The code is a bit verbose and repetitive, but it’s rather straightforward (and you could easily wrap it inside a small helper function). Now that we have an Example protobuf, we can serialize it by calling its SerializeToString() method, then write the resulting data to a TFRecord file:

with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

Normally you would write much more than one Example! Typically, you would create a conversion script that reads from your current format (say, CSV files), creates an Example protobuf for each instance, serializes them, and saves them to several TFRecord files, ideally shuffling them in the process. This requires a bit of work, so once again make sure it is really necessary (perhaps your pipeline works fine with CSV files).

Now that we have a nice TFRecord file containing a serialized Example, let’s try to load it.

Loading and Parsing Examples

To load the serialized Example protobufs, we will use a tf.data.TFRecordDataset once again, and we will parse each Example using tf.io.parse_single_example(). This is a TensorFlow operation, so it can be included in a TF Function. It requires at least two arguments: a string scalar tensor containing the serialized data, and a description of each feature. The description is a dictionary that maps each feature name to either a tf.io.FixedLenFeature descriptor indicating the feature’s shape, type, and default value, or a tf.io.VarLenFeature descriptor indicating only the type (if the length of the feature’s list may vary, such as for the "emails" feature).

The following code defines a description dictionary, then it iterates over the TFRecord Dataset and parses(based on description) the serialized Example protobuf this dataset contains:

feature_description = {
                                # feature’s shape, type, and default value
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
            # only the type (if the length of the feature’s list may vary
    "emails": tf.io.VarLenFeature(tf.string)
}

for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example,
                                                feature_description)
parsed_example

The fixed-length features("name", "id") are parsed as regular tensors, but the variable-length features("emails") are parsed as sparse tensors. You can convert a sparse tensor to a dense tensor using tf.sparse.to_dense(), but in this case it is simpler to just access its values:

parsed_example['emails'].values[0]

tf.sparse.to_dense(parsed_example["emails"], default_value=b"")

parsed_example["emails"].values

parsed_example

A BytesList can contain any binary data you want, including any serialized object. For example, you can use tf.io.encode_jpeg() to encode an image using the JPEG format and put this binary data in a BytesList. Later, when your code reads the TFRecord, it will start by parsing the Example, then it will need to call tf.io.decode_jpeg() to parse the data and get the original image (or you can use tf.io.decode_image(), which can decode any BMP, GIF, JPEG, or PNG image). You can also store any tensor you want in a BytesList by serializing the tensor using tf.io.serialize_tensor() then putting the resulting byte string in a BytesList feature. Later, when you parse the TFRecord, you can parse this data using tf.io.parse_tensor().

Putting Images in TFRecords

from sklearn.datasets import load_sample_images

img = load_sample_images()['images'][0]
plt.imshow(img)
plt.axis('off')
plt.title('Original Image')
plt.show()

... ...

ImportError: The Python Imaging Library (PIL) is required to load data from jpeg files. Please refer to https://pillow.readthedocs.io/en/stable/installation.html for installing PIL.

Solution:

Next, restart your Kernel in your jupyter notebook

from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt

img = load_sample_images()['images'][0]
plt.imshow(img)
plt.axis('off')
plt.title('Original Image')
plt.show()

# encode an image using the JPEG format and put this binary data in a BytesList.
data = tf.io.encode_jpeg(img)
# message Example { Features features = 1; };
example_with_image = Example( features=Features(
    # message Features { map feature = 1; };
    feature={
        #map       #BytesList bytes_list = 1;
        "image": Feature( bytes_list=BytesList( value=[data.numpy()] ) )
    }))

serialized_example = example_with_image.SerializeToString() 
# This is the binary data that is ready to be saved or transmitted over the network.
# then write the resulting data to a TFRecord file

Later, when your code reads the TFRecord, it will start by parsing the Example, then it will need to call tf.io.decode_jpeg() to parse the data and get the original image

feature_description = { 'image': tf.io.VarLenFeature(tf.string) }
example_with_image = tf.io.parse_single_example( serialized_example, feature_description )

decoded_img = tf.io.decode_jpeg( example_with_image['image'].values[0] )

plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")
plt.show()

Or use decode_image() which supports BMP, GIF, JPEG and PNG formats:

decoded_img = tf.io.decode_image( example_with_image['image'].values[0] )

plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")
plt.show()

Putting Tensors and Sparse Tensors in TFRecords¶

Tensors can be serialized and parsed easily using tf.io.serialize_tensor() and tf.io.parse_tensor():

t = tf.constant([ [0., 1.], [2., 3.], [4., 5.] ])
# store current tensor in a BytesList by serializing the tensor
s = tf.io.serialize_tensor(t)
s

tf.io.parse_tensor(s, out_type=tf.float32)

parsed_example["emails"]

serialized_sparse = tf.io.serialize_sparse( parsed_example["emails"])
serialized_sparse

BytesList(value=serialized_sparse.numpy())

Instead of parsing examples one by one using tf.io.parse_single_example(), you may want to parse them batch by batch using tf.io.parse_example():

feature_description = {
                                # feature’s shape, type, and default value
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
            # only the type (if the length of the feature’s list may vary
    "emails": tf.io.VarLenFeature(tf.string)
}

# for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
#     parsed_example = tf.io.parse_single_example(serialized_example,
#                                                 feature_description)
# parsed_example

dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)

for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(serialized_examples, feature_description)

parsed_examples

parsed_examples['emails'].values

As you can see, the Example protobuf will probably be sufficient for most use cases. However, it may be a bit cumbersome to use when you are dealing with lists of lists.
For example, suppose you want to classify text documents. Each document may be represented as a list of sentences, where each sentence is represented as a list of words. And perhaps each document also has a list of comments, where each comment is represented as a list of words. There may be some contextual data too, such as the document’s author, title, and publication date. TensorFlow’s SequenceExample protobuf is designed for such use cases.

Handling Lists of Lists Using the SequenceExample Protobuf

Here is the definition of the SequenceExample protobuf:

syntax = "proto3";

message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map feature_list = 1; };
message SequenceExample {
  Features context = 1;
  FeatureLists feature_lists = 2;
};

A SequenceExample contains a Features object for the contextual data and a FeatureLists object that contains one or more named FeatureList objects (e.g., a FeatureList named "content" and another named "comments"). Each FeatureList contains a list of Feature objects, each of which may be a list of byte strings, a list of 64-bit integers, or a list of floats (in this example, each Feature would represent a sentence or a comment, perhaps in the form of a list of word identifiers). Building a SequenceExample, serializing it, and parsing it is similar to building, serializing, and parsing an Example, but you must use tf.io.parse_single_sequence_example() to parse a single SequenceExample or tf.io.parse_sequence_example() to parse a batch. Both functions return a tuple containing the context features (as a dictionary) and the feature lists (also as a dictionary). If the feature lists contain sequences of varying sizes (as in the preceding example), you may want to convert them to ragged tensors, using tf.RaggedTensor.from_sparse():

# from tensorflow.train import FeatureList, FeatureLists, SequenceExample

BytesList = tf.train.BytesList
Int64List = tf.train.Int64List

Feature = tf.train.Feature
Features = tf.train.Features

FeatureList = tf.train.FeatureList
FeatureLists = tf.train.FeatureLists
SequenceExample = tf.train.SequenceExample

# Features context = 1;
context = Features(
    # message Features { map feature = 1; };
    feature={
        # map;
        "author_id": Feature( int64_list=Int64List(value=[123]) ),
        "title": Feature( bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."]) ),
        "pub_date": Feature( int64_list=Int64List(value=[1623, 12, 25]) ),
    }   
)
                                                      
content = [["When", "shall", "we", "three", "meet", "again", "?"],
           ["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]
          ]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]
           ]                                                    

# message Feature {
#     oneof kind {
#         BytesList 'bytes_list' = 1;
#         FloatList 'float_list' = 2;
#         Int64List 'int64_list' = 3;
#     }
# };                                                                                                            
def words_to_feature(words):
    return Feature( bytes_list=BytesList( value=[word.encode("utf-8") for word in words] ) )

#repeated Feature feature =>[feature, feature] :each Feature would represent a sentence or comment
# feature=content_features                                                      
content_features = [words_to_feature(sentence) for sentence in content]                                                      
comments_features = [words_to_feature(comment) for comment in comments]                                                      
                                                      
sequence_example = SequenceExample(
    # Features 'context' = 1;
    context = context,
    
    # FeatureLists 'feature_lists' = 2;
    feature_lists = FeatureLists( 
        # FeatureLists { map feature_list = 1; };
        feature_list={
            # map
                       # FeatureList { repeated Feature feature = 1; };
                       # Each FeatureList contains a list of Feature objects
            "content": FeatureList( feature=content_features),
            "comments": FeatureList( feature=comments_features)
        }
    )
)

sequence_example

A SequenceExample contains a Features object(context) for the contextual data and a FeatureLists object(feature_lists) that contains one or more named FeatureList objects (e.g., a FeatureList named "content" and another named "comments").
Each FeatureList contains a list of Feature objects, each of which may be a list of byte strings, a list of 64-bit integers, or a list of floats (in this example, each Feature would represent a sentence or a comment, perhaps in the form of a list of word identifiers).

context {

}

feature_lists {

}

serialized_sequence_example = sequence_example.SerializePartialToString()

Building a SequenceExample, serializing it, and parsing it is similar to building, serializing, and parsing an Example, but you must use tf.io.parse_single_sequence_example() to parse a single SequenceExample or tf.io.parse_sequence_example() to parse a batch. Both functions return a tuple containing the context features (as a dictionary) and the feature lists (also as a dictionary). If the feature lists contain sequences of varying sizes (as in the preceding example), you may want to convert them to ragged tensors, using tf.RaggedTensor.from_sparse():

context_feature_descriptions = {
    "author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "title": tf.io.VarLenFeature(tf.string),
    "pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0,0,0]),
}

sequence_feature_descriptions = {
    "content": tf.io.VarLenFeature(tf.string),
    "comments": tf.io.VarLenFeature(tf.string)
}

parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, context_feature_descriptions, sequence_feature_descriptions
)

parsed_context

parsed_context['title'].values

parsed_feature_lists

print(
    tf.RaggedTensor.from_sparse(parsed_feature_lists['content'])
)

Now that you know how to efficiently store, load, and parse data, the next step is to prepare it so that it can be fed to a neural network.

Preprocessing the Input Features

https://blog.csdn.net/Linli522362242/article/details/107933572

你可能感兴趣的:(13_Loading and Preprocessing Data from multiple CSV with TensorFlow_custom training loop_TFRecord)

谷歌：对比学习将LLM转为嵌入模型大模型任我行大模型-成熟基座人工智能自然语言处理语言模型论文笔记
标题：GeminiEmbedding:GeneralizableEmbeddingsfromGemini来源：arXiv,2503.07891摘要在本报告中，我们介绍了Gemini嵌入，这是一种最先进的嵌入模型，它利用了Gemini、Google最有能力的大型语言模型的力量。利用Gemini固有的多语言和代码理解能力，GeminiEmbedding为跨越多种语言和文本模式的文本生成高度可概括的嵌入
快速启动flink项目 for your wish flink java 大数据
按照这个步骤1分钟内创建完成idea-----File----new---Project------Maven----Createfromarchetype----AddArchetype弹出框：GroupId填org.apache.flinkArtifactId填flink-quickstart-javaVersion填1.14.0选中刚刚添加的Archetype，点Next填写你要创建的这个f
taosdump备份多个数据库近10天的数据会飞的土拨鼠呀运维学习笔记数据库 oracle
以下是使用taosdump备份多个指定数据库近10天数据的详细步骤：方法1：直接指定多个数据库名称通过--databases参数直接指定多个数据库，逗号分隔：#!/bin/bash#定义备份目录和时间范围BACKUP_DIR="./backup"START_TIME=$(date-d"10daysago""+%Y-%m-%d%H:%M:%S")END_TIME=$(date"+%Y-%m-%d%H
【MyDB】6-TabelManager 字段与表管理之2-SQL语句解析 -$_$- Java项目 sql python 数据库
【MyDB】6-TabelManager字段与表管理之2-SQL语句解析前言SQL语法Parser类具体实现入口方法Parse(byte[]statement)事务控制parseBegin()parseCommit()，parseAbortDDL(DataDefinitionLanguage)parseCreate()parseDrop()DML语句parseSelect()parseInsert
Mybatis和Mybatis-plus常用注解 AWen_X Java常用框架注解 mybatis 开发语言 java 后端 spring boot spring
Mybatis和Mybatis-Plus常用注解一、Mybatis常用注解1.@Select注解说明：标记查询语句，用于定义查询操作的SQL语句。代码示例：@Select("SELECT*FROMusersWHEREid=#{id}")UsergetUserById(@Param("id")Longid);注解处理类：由org.apache.ibatis.builder.annotation.Ma
python中Flask模块的使用 weixin_30315905 python json
1.简介在服务器上运行Flask接口，就能使用requests模块获取该接口的值。先运行接口文件，再运行requests文件，即可获取值。2.示例2.1一个简单的flask接口1importjson2fromflaskimportFlask,request34#python类型5data={6'name':'John',7'age':18,8'location':'nanjing'910}1112
clickhouse创建数据库以及表鸭梨山大哎 clickhouse
创建数据库语法CREATEDATABASE[IFNOTEXISTS]db_name[ONCLUSTERcluster][ENGINE=engine(...)]例子CREATEDATABASEIFNOTEXISTSchtest;--使用默认库引擎创建库默认情况下，ClickHouse使用的是原生的数据库引擎Ordinary(在此数据库下可以使用任意类型的表引擎，在绝大多数情况下都只需使用默认的数据库
mysql-大批量插入数据的三种方式和使用场景不穿铠甲的穿山甲 mysql 数据库
1.批量插入三种方式INSERTINTO…SELECTINSERTINTO…VALUES(…)LOADDATAINFILE‘/path/to/datafile.csv’INTOTABLEtable_name2.批量插入2.1INSERTINTO…SELECT用途：从另一个表中选择数据并插入到目标表中。语法示例：INSERTINTOtarget_table(column1,column2)SELEC
《深度剖析：BERT与GPT——自然语言处理架构的璀璨双星》人工智能深度学习
在自然语言处理（NLP）的广袤星空中，BERT（BidirectionalEncoderRepresentationsfromTransformers）与GPT（GenerativePretrainedTransformer）系列模型宛如两颗最为耀眼的星辰，引领着NLP技术不断迈向新的高度。它们基于独特的架构设计，以强大的语言理解与生成能力，彻底革新了NLP的研究与应用范式，成为学界和业界竞相探索
python中的类方法，静态方法，对象方法 a174817529
原文地址：http://blog.chinaunix.net/uid-26602509-id-3087296.htmlclassA:count=100def__init__(self,instancedata):self.instancedata=instancedata@staticmethod#静态方法不能访问类参数和实例参数defsm():print"sm"@classmethod#类方法不
MySQL进阶——提高查询效率之添加索引的全部方式 1加1等于 MySQL sql mysql
索引提高查询效率，本文介绍优化查询时添加索引的多种方式。本文目录一、创建表时添加索引二、使用ALTERTABLE语句添加索引三、使用CREATEINDEX语句添加索引一、创建表时添加索引在使用CREATETABLE语句创建表的同时，可以为表中的列添加索引。适用于在设计表结构时就确定需要添加索引的情况。语法如下：CREATETABLEtable_name(column1datatype,column
Error response from daemon: readlink /var/lib/docker/overlay2: invalid argument 报错解决方法小兔快跑zZ docker java 容器
docker安装启动dify的时候报错Errorresponsefromdaemon:readlink/var/lib/docker/overlay2:invalidargument解决方案：打开dockerdesktop点击clean/purgedata清除之后重启一下就可以了
Ubuntu20.04安装并配置Pycharm2020.2.5 搬砖的打工人!!! ubuntu pycharm python
一.下载pycharm社区版1.下载地址：PyCharm:thePythonIDEfordatascienceandwebdevelopmentThePythonIDEfordatascienceandwebdevelopmentwithintelligentcodecompletion,on-the-flyerrorchecking,quick-fixes,andmuchmore.https:/
商店管理系统 Charon424 c语言链表数据结构
源代码：//商店管理系统（人员信息，商品信息）#include#include#include#includetypedefstructdate//出生年月日{intyear;intmonth;intday;}DATA;typedefstructinformation//人员信息{charname[20];//名字longid;//编号charsex[20];//性别charfunction[20
Hive 实际应用场景及对应SQL示例小技工丨大数据随笔 hive sql hadoop 大数据数据仓库
Hive实际应用场景及对应SQL示例一、‌日志分析场景‌**场景说明‌：**处理大规模日志数据（如Web访问日志），分析用户行为或系统运行状态。SQL示例‌：--统计每日UV（用户访问量）SELECTdate,COUNT(DISTINCTuser_id)ASdaily_uvFROMweb_logsWHEREevent_type='page_view'GROUPBYdate;技术要点‌：使用DIST
【pygame】小球运动碰撞动画生成并保存为GIF 一只小白跳起来 pygame pygame python
pygame本身并不直接支持将动画保存为GIF，但可以用Pillow将截图的图像合成一个GIF。importpygameimportsysfromPILimportImage,ImageSequenceimportos#初始化pygamepygame.init()#设置窗口大小WIDTH,HEIGHT=800,600screen=pygame.display.set_mode((WIDTH,HEI
Ceph数据恢复方案–分布式文件系统删除数据的恢复 San结构数据恢复数据恢复相关 ceph
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、Ceph的三种存储结构二、Ceph中删除数据的恢复提取1.本次案例情况简介：2.数据分析：2.1：BlueStore架构2.2分布式存储中元数据概述2.3提取元数据2.3.2：获取meta_data2.3.4.元数据整理2.3.5.计算数据地址3.数据恢复提取总结前言什么是分布式文件系统分布式文件系统（Distribu
SpringBoot配置文件加载优先级不修×蝙蝠 spring boot 后端 java 优先级
在SpringBoot项目中，配置属性的优先级是一个重要的概念，它决定了当存在多个配置源时，哪个配置源的属性将被应用。以下是SpringBoot中配置属性的优先级，从最高到最低：命令行参数：命令行参数具有最高的优先级。SpringBoot项目被启动时，可以通过命令行传递参数，这些参数将覆盖其他所有配置源中的相同属性。java-jaryour-application.jar--spring.data
141.HarmonyOS NEXT系列教程之3D立方体旋转轮播案例讲解之IDataSource接口实现 harmonyos-next
温馨提示：本篇博客的详细代码已发布到git:https://gitcode.com/nutpi/HarmonyosNext可以下载运行哦！HarmonyOSNEXT系列教程之3D立方体旋转轮播案例讲解之IDataSource接口实现效果演示1.IDataSource接口概述1.1接口作用IDataSource接口定义了数据源的标准接口，提供了：数据访问方法数据操作方法监听器管理方法1.2接口实现e
Selenium自动化之鼠标悬停操作 DansonC python 自动化 selenium 自动化鼠标悬停
#鼠标悬停，move_to_element，光标移到指定元素上，停留几秒fromselenium.webdriverimportActionChainsActionChains(self.driver).move_to_element§.perform()time.sleep(2)importunittestimporttimefromseleniumimportwebdriverfromsele
Pytest的夹具共享（2）活跃家族 pytest 服务器运维
1、问题：夹具跟用例都是写在一个py文件中，在自动化框架中，测试用例、夹具在不同的文件中，跨文件夹具使用呢？“”"在XXX测试用例模块中，使用夹具？如何跨文件调用？-1）导包：fromd5_pytest的夹具的作用域importsetup_teardown–不推荐，每个模块都要导包-2）夹具共享：如果要直接实现多个模块(py文件)共享夹具，可以用conftest。不需要导包。1、创建一个conft
10篇R1相关的研究全面汇总，万字思考！ datawhale
原创长琴DatawhaleDatawhale干货作者：长琴，Datawhale成员本文通过10篇R1相关的研究，介绍R1后LLM的新范式。其核心就是如何进一步增强LLM的能力。本文中的相关阅读，可以在主页找到对应文章：主页地址：https://yam.gift/基本框架首先是整体的框架，如下所示。•Base+SFT•R1冷启动•LIMO(817DataSelection)•s1(1000)•Bas
Neo4j GDS-02-graph-data-science 插件库安装实战笔记后端java
neo4japoc系列Neo4jAPOC-01-图数据库apoc插件介绍Neo4jAPOC-01-图数据库apoc插件安装neo4jonwindows10Neo4jAPOC-03-图数据库apoc实战使用使用Neo4jAPOC-04-图数据库apoc实战使用使用apoc.path.spanningTree最小生成树Neo4jAPOC-05-图数据库apoc实战使用使用labelFilterNeo4
【MySQL必知必会】数据库操纵语言（DML）超全总结：增删改查一文搞定！秀儿还能再秀数据库 MySQL 学习笔记
一、DML简介数据库操纵语言（DataManipulationLanguage,DML）是SQL的核心组成部分，主要用于对数据库中的数据进行增（INSERT）、删（DELETE）、改（UPDATE）、查（SELECT）操作，掌握DML都是必备技能！二、核心操作详解1.插入数据：INSERT--插入单条数据（全字段）INSERTINTO表名VALUES(值1,值2,...);--指定字段插入INSE
学习Python如何高效处理CSV文件的技巧！程序员总部 python python json
在Python中，处理CSV文件是一项非常常见的任务，特别是在数据分析和数据科学领域。CSV文件的全称是Comma-SeparatedValues，顾名思义，它以逗号为分隔符来存储表格数据。这种格式简单易读，也很方便进行数据的存储和交换。接下来就让我们一起探讨一下如何在Python中读取和写入CSV文件吧！CSV模块简介Python内置了一个非常强大的库，名为csv，这个库专门用于处理各种CSV文
Matlab S-Function模块简谈 Captain cool290 matlab
1.单个输出形式的m脚本文件如何编写functiony=fcn()y=u;最简单的例子：就是输出等于输入点一下标题栏：gotodiagram就可以回到模型界面；EditData可以修改数据类型2.多个输出形式的m脚本如何编写function[y1,y2]=fcn(u1,u2)y1=u1+u2;y2=u2-u1;这样就可以多个输入和输出了。mamatlab3.注意点：S-Function中的变量u是
Spring Boot 中的 @ConditionalOnBean 注解详解 QQ148698 spring boot java 后端
SpringBoot中的@ConditionalOnBean注解详解1.前言2.`@ConditionalOnBean`作用与基本用法2.1`@ConditionalOnBean`的作用2.2基本用法示例：当`DataSource`Bean存在时，才创建`MyService`Bean3.`@ConditionalOnBean`详解3.1`value`和`type`属性（指定Bean类型）3.2`n
hive 使用oracle数据库 sardtass hadoop hive 开源项目
hive使用oracle作为数据源，导入数据使用sqoop或kettle或自己写代码（淘宝的开源项目中有一个xdata就是淘宝自己写的）。感觉sqoop比kettle快多了，淘宝的xdata没用过。hive默认使用derby作为存储表信息的数据库，默认在哪启动就在哪建一个metadata_db文件放数据，可以在conf下的hive-site.xml中配置为一个固定的位置，这样不论在哪启动都可以了。
【图论】数组模拟邻接表存储(链式前向星) ars4me 图论数据结构图论邻接表前向星
图的邻接表存储法又叫链式存储法可以用数组模拟定义structedge{intnext;//下一条边的编号intto;//这条边到达的点intdis;//这条边的长度}edge[size];//COYG核心代码加入一条从from到to距离为dis的单向边inlinevoidadd(intfrom,intto,intdis){edge[++num].next=head[from];edge[num].
react-11（自定义hook、useRef）我只是想饮一杯奶茶 react JavaScript react.js javascript 前端 typescript
自定义hook主要是我们利用已有的hook,实现hook的效果，并在其他地方引用。（感觉像是封装了一个方法）//定义constfetchHook=(url)=>{const[data,setData]=useState({})const[loading,setLoading]=useState('')useEffect(()=>{setLoading('loading')axios.get(url
戴尔笔记本win8系统改装win7系统 sophia天雪 win7 戴尔改装系统 win8
戴尔win8 系统改装win7 系统详述第一步：使用U盘制作虚拟光驱： 1）下载安装UltraISO：注册码可以在网上搜索。 2）启动UltraISO，点击“文件”—》“打开”按钮，打开已经准备好的ISO镜像文
BeanUtils.copyProperties使用笔记 bylijinnan java
BeanUtils.copyProperties VS PropertyUtils.copyProperties 两者最大的区别是： BeanUtils.copyProperties会进行类型转换，而PropertyUtils.copyProperties不会。既然进行了类型转换，那BeanUtils.copyProperties的速度比不上PropertyUtils.copyProp
MyEclipse中文乱码问题 0624chenhong MyEclipse
一、设置新建常见文件的默认编码格式，也就是文件保存的格式。在不对MyEclipse进行设置的时候，默认保存文件的编码，一般跟简体中文操作系统（如windows2000，windowsXP）的编码一致，即GBK。在简体中文系统下，ANSI 编码代表 GBK编码;在日文操作系统下，ANSI 编码代表 JIS 编码。 Window-->Preferences-->General -
发送邮件不懂事的小屁孩 send email
import org.apache.commons.mail.EmailAttachment; import org.apache.commons.mail.EmailException; import org.apache.commons.mail.HtmlEmail; import org.apache.commons.mail.MultiPartEmail;
动画合集换个号韩国红果果 html css
动画指一种样式变为另一种样式 keyframes应当始终定义0 100 过程 1 transition 制作鼠标滑过图片时的放大效果 css .wrap{ width: 340px;height: 340px; position: absolute; top: 30%; left: 20%; overflow: hidden; bor
网络最常见的攻击方式竟然是SQL注入蓝儿唯美 sql注入
NTT研究表明，尽管SQL注入（SQLi）型攻击记录详尽且为人熟知，但目前网络应用程序仍然是SQLi攻击的重灾区。信息安全和风险管理公司NTTCom Security发布的《2015全球智能威胁风险报告》表明，目前黑客攻击网络应用程序方式中最流行的，要数SQLi攻击。报告对去年发生的60亿攻击行为进行分析，指出SQLi攻击是最常见的网络应用程序攻击方式。全球网络应用程序攻击中，SQLi攻击占
java笔记2 a-john java
类的封装： 1，java中，对象就是一个封装体。封装是把对象的属性和服务结合成一个独立的的单位。并尽可能隐藏对象的内部细节（尤其是私有数据） 2，目的：使对象以外的部分不能随意存取对象的内部数据（如属性），从而使软件错误能够局部化，减少差错和排错的难度。 3，简单来说，“隐藏属性、方法或实现细节的过程”称为——封装。 4，封装的特性： 4.1设置
[Andengine]Error：can't creat bitmap form path “gfx/xxx.xxx” aijuans 学习Android遇到的错误
最开始遇到这个错误是很早以前了，以前也没注意，只当是一个不理解的bug，因为所有的texture，textureregion都没有问题，但是就是提示错误。昨天和美工要图片，本来是要背景透明的png格式，可是她却给了我一个jpg的。说明了之后她说没法改，因为没有png这个保存选项。我就看了一下，和她要了psd的文件，还好我有一点
自己写的一个繁体到简体的转换程序 asialee java 转换繁体 filter 简体
今天调研一个任务，基于java的filter实现繁体到简体的转换，于是写了一个demo，给各位博友奉上，欢迎批评指正。实现的思路是重载request的调取参数的几个方法，然后做下转换。
android意图和意图监听器技术百合不是茶 android 显示意图隐式意图意图监听器
Intent是在activity之间传递数据;Intent的传递分为显示传递和隐式传递显式意图：调用Intent.setComponent() 或 Intent.setClassName() 或 Intent.setClass()方法明确指定了组件名的Intent为显式意图，显式意图明确指定了Intent应该传递给哪个组件。隐式意图;不指明调用的名称,根据设
spring3中新增的@value注解 bijian1013 java spring @Value
在spring 3.0中，可以通过使用@value，对一些如xxx.properties文件中的文件，进行键值对的注入，例子如下： 1.首先在applicationContext.xml中加入： <beans xmlns="http://www.springframework.
Jboss启用CXF日志 sunjing log jboss CXF
1. 在standalone.xml配置文件中添加system-properties： <system-properties> <property name="org.apache.cxf.logging.enabled" value=&
【Hadoop三】Centos7_x86_64部署Hadoop集群之编译Hadoop源代码 bit1129 centos
编译必需的软件 Firebugs3.0.0 Maven3.2.3 Ant JDK1.7.0_67 protobuf-2.5.0 Hadoop 2.5.2源码包 Firebugs3.0.0 http://sourceforge.jp/projects/sfnet_findbug
struts2验证框架的使用和扩展白糖_ 框架 xml bean struts 正则表达式
struts2能够对前台提交的表单数据进行输入有效性校验，通常有两种方式： 1、在Action类中通过validatexx方法验证，这种方式很简单，在此不再赘述； 2、通过编写xx-validation.xml文件执行表单验证，当用户提交表单请求后，struts会优先执行xml文件，如果校验不通过是不会让请求访问指定action的。本文介绍一下struts2通过xml文件进行校验的方法并说
记录-感悟 braveCS 感悟
再翻翻以前写的感悟，有时会发现自己很幼稚，也会让自己找回初心。 2015-1-11 1. 能在工作之余学习感兴趣的东西已经很幸福了； 2. 要改变自己，不能这样一直在原来区域，要突破安全区舒适区，才能提高自己，往好的方面发展； 3. 多反省多思考；要会用工具，而不是变成工具的奴隶； 4. 一天内集中一个定长时间段看最新资讯和偏流式博
编程之美-数组中最长递增子序列 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class LongestAccendingSubSequence { /** * 编程之美数组中最长递增子序列 * 书上的解法容易理解 * 另一方法书上没有提到的是，可以将数组排序（由小到大）得到新的数组， * 然后求排序后的数组与原数
读书笔记5 chengxuyuancsdn 重复提交 struts2的token验证
1、重复提交 2、struts2的token验证 3、用response返回xml时的注意 1、重复提交 (1)应用场景 (1-1)点击提交按钮两次。 (1-2)使用浏览器后退按钮重复之前的操作，导致重复提交表单。 (1-3)刷新页面 (1-4)使用浏览器历史记录重复提交表单。 (1-5)浏览器重复的 HTTP 请求。 (2)解决方法 (2-1)禁掉提交按钮 (2-2)
[时空与探索]全球联合进行第二次费城实验的可能性 comsci
二次世界大战前后,由爱因斯坦参加的一次在海军舰艇上进行的物理学实验 -费城实验至今给我们大家留下很多迷团..... 关于费城实验的详细过程,大家可以在网络上搜索一下,我这里就不详细描述了在这里,我的意思是,现在
easy connect 之 ORA-12154: TNS: 无法解析指定的连接标识符 daizj oracle ORA-12154
用easy connect连接出现“tns无法解析指定的连接标示符”的错误，如下： C:\Users\Administrator>sqlplus username/[email protected]:1521/orcl SQL*Plus: Release 10.2.0.1.0 – Production on 星期一 5月 21 18:16:20 2012 Copyright (c) 198
简单排序:归并排序 dieslrae 归并排序
public void mergeSort(int[] array){ int temp = array.length/2; if(temp == 0){ return; } int[] a = new int[temp]; int
C语言中字符串的\0和空格 dcj3sjt126com c
\0 为字符串结束符，比如说： abcd (空格)cdefg；存入数组时，空格作为一个字符占有一个字节的空间，我们
解决Composer国内速度慢的办法 dcj3sjt126com Composer
用法：有两种方式启用本镜像服务： 1 将以下配置信息添加到 Composer 的配置文件 config.json 中（系统全局配置）。见“例1” 2 将以下配置信息添加到你的项目的 composer.json 文件中（针对单个项目配置）。见“例2” 为了避免安装包的时候都要执行两次查询，切记要添加禁用 packagist 的设置，如下 1 2 3 4 5
高效可伸缩的结果缓存 shuizhaosi888 高效可伸缩的结果缓存
/** * 要执行的算法，返回结果v */ public interface Computable<A, V> { public V comput(final A arg); } /** * 用于缓存数据 */ public class Memoizer<A, V> implements Computable<A,
三点定位的算法 haoningabc c 算法
三点定位，已知a,b,c三个顶点的x,y坐标和三个点都z坐标的距离，la，lb,lc 求z点的坐标原理就是围绕a,b,c 三个点画圆，三个圆焦点的部分就是所求但是，由于三个点的距离可能不准，不一定会有结果，所以是三个圆环的焦点，环的宽度开始为0，没有取到则加1 运行 gcc -lm test.c test.c代码如下 #include "stdi
epoll使用详解 jimmee c linux 服务端编程 epoll
epoll - I/O event notification facility在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linu
Hibernate对Enum的映射的基本使用方法 linzx0212 enum Hibernate
枚举 /** * 性别枚举 */ public enum Gender { MALE(0), FEMALE(1), OTHER(2); private Gender(int i) { this.i = i; } private int i; public int getI
第10章高级事件（下） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
孙子兵法 roadrunners 孙子兵法
始计第一孙子曰：兵者，国之大事，死生之地，存亡之道，不可不察也。故经之以五事，校之以计，而索其情：一曰道，二曰天，三曰地，四曰将，五曰法。道者，令民于上同意，可与之死，可与之生，而不危也；天者，阴阳、寒暑、时制也；地者，远近、险易、广狭、死生也；将者，智、信、仁、勇、严也；法者，曲制、官道、主用也。凡此五者，将莫不闻，知之者胜，不知之者不胜。故校之以计，而索其情，曰
MySQL双向复制 tomcat_oracle mysql
本文包括: 主机配置从机配置建立主-从复制建立双向复制背景按照以下简单的步骤: 参考一下：在机器A配置主机(192.168.1.30) 在机器B配置从机(192.168.1.29) 我们可以使用下面的步骤来实现这一点步骤1：机器A设置主机在主机中打开配置文件 ,
zoj 3822 Domination(dp) 阿尔萨斯 Mina
题目链接：zoj 3822 Domination 题目大意：给定一个N∗M的棋盘，每次任选一个位置放置一枚棋子，直到每行每列上都至少有一枚棋子，问放置棋子个数的期望。解题思路：大白书上概率那一张有一道类似的题目，但是因为时间比较久了，还是稍微想了一下。dp[i][j][k]表示i行j列上均有至少一枚棋子，并且消耗k步的概率（k≤i∗j）,因为放置在i+1~n上等价与放在i+1行上，同理