LIQING LIN

13_Loading & Prep 4_[..., np.newaxis]_ExitStack_walk_file操作_timeit_正则_feature vector embed_Text Toke

tf.shape(X_example)*tf.constant([1,0])+ tf.constant([0, 50])

13_Loading and Preprocessing Data from multiple CSV with TensorFlow_custom training loop_TFRecord
https://blog.csdn.net/Linli522362242/article/details/107704824

13_Loading and Preprocessing Data from multiple CSV with TensorFlow 2_Feature Columns_TF eXtended
https://blog.csdn.net/Linli522362242/article/details/107933572

13_Loading & Preprocessing Data with TF 3_TF Datasets_images[index, ...,0]_plt images_profiling data
https://blog.csdn.net/Linli522362242/article/details/108039793

7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

Let’s look at the pros and cons of each preprocessing option:

If you preprocess the data when creating the data files, the training script will run faster, since it will not have to perform preprocessing on the fly. In some cases, the preprocessed data will also be much smaller than the original data, so you can save some space and speed up downloads. It may also be helpful to materialize实例化 the preprocessed data, for example to inspect it or archive it. However, this approach has a few cons. First, it’s not easy to experiment with various preprocessing logics if you need to generate a preprocessed dataset for each variant. Second, if you want to perform data augmentation扩充, you have to materialize many variants of your dataset, which will use a large amount of disk space and take a lot of time to generate. Lastly, the trained model will expect preprocessed data, so you will have to add preprocessing code in your application before it calls the model.
If the data is preprocessed with the tf.data pipeline, it’s much easier to tweak the preprocessing logic and apply data augmentation. Also, tf.data makes it easy to build highly efficient preprocessing pipelines (e.g., with multithreading and prefetching). However, preprocessing the data this way will slow down training. Moreover, each training instance will be preprocessed once per epoch rather than just once if the data was preprocessed when creating the data files. Lastly, the trained model will still expect preprocessed data.
If you add preprocessing layers to your model, you will only have to write the preprocessing code once for both training and inference. If your model needs to be deployed to many different platforms, you will not need to write the preprocessing code multiple times. Plus, you will not run the risk of using the wrong preprocessing logic for your model, since it will be part of the model. On the downside, preprocessing the data will slow down training, and each training instance will be preprocessed once per epoch. Moreover, by default the preprocessing operations will run on the GPU for the current batch (you will not benefit from parallel preprocessing on the CPU, and prefetching). Fortunately, the upcoming Keras preprocessing layers should be able to lift the preprocessing operations from the preprocessing layers and run them as part of the tf.data pipeline, so you will benefit from multithreaded execution on the CPU and prefetching.
Lastly, using TF Transform for preprocessing gives you many of the benefits from the previous options: the preprocessed data is materialized, each instance is preprocessed just once (speeding up training), and preprocessing layers get generated automatically so you only need to write the preprocessing code once. The main drawback is the fact that you need to learn how to use this tool.

8. Name a few common techniques you can use to encode categorical features. What about text?

Let’s look at how to encode categorical features and text:

To encode a categorical feature that has a natural order, such as a movie rating (e.g., “bad,” “average,” “good”), the simplest option is to use ordinal encoding: sort the categories in their natural order and map each category to its rank (e.g., “bad” maps to 0, “average” maps to 1, and “good” maps to 2). However, most categorical features don’t have such a natural order. For example, there’s no natural order for professions or countries. In this case, you can use one-hot encoding or, if there are many categories, embeddings.
For text, one option is to use a bag-of-words representation: a sentence is represented by a vector counting the counts of each possible word. Since common words are usually not very important, you’ll want to use TF-IDF(Term-Frequency × Inverse-Document-Frequency) to reduce their weight. Instead of counting words, it is also common to count n-grams, which are sequences of n consecutive words—nice and simple. Alternatively, you can encode each word using word embeddings, possibly pretrained. Rather than encoding words, it is also possible to encode each letter, or subword tokens (e.g., splitting “smartest” into “smart” and “est”). These last two options are discussed in Chapter 16.

9. Load the Fashion MNIST dataset (introduced in 10_Introduction to Artificial Neural Networks w Keras_3_FashionMNIST_pydot_sparse_shift(0.)_plt_imgs https://blog.csdn.net/Linli522362242/article/details/106562190); split it into a training set, a validation set, and a test set;

from tensorflow import keras
import numpy as np
import tensorflow as tf

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
# len(y_train_full), len(y_test) ==> (60000, 10000)
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

shuffle the training set; and save each dataset to multiple TFRecord files( numpy.ndarray ==> tf.data.Dataset.from_tensor_slices() ==>serialized Example ==> example.SerializeToString()). Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image, (BytesList)), and the label(Int64List).

Note: for large images, you could use tf.io.encode_jpeg() instead. This would save a lot of space, but it would lose a bit of image quality.

# type(X_train) ==> numpy.ndarray

# from_tensors: combines the input and returns a dataset with a single element 
# t = tf.constant([[1, 2], [3, 4]])
# ds = tf.data.Dataset.from_tensors(t)   ==> [[1, 2], [3, 4]]

# from_tensor_slices: creates a dataset with a separate element for each row of the input tensor
# t = tf.constant([[1, 2], [3, 4]])
# ds = tf.data.Dataset.from_tensor_slices(t)   # [1, 2], [3, 4]
train_set = tf.data.Dataset.from_tensor_slices( (X_train, y_train) ).shuffle( buffer_size=len(X_train) )
valid_set = tf.data.Dataset.from_tensor_slices( (X_valid, y_valid) )
test_set = tf.data.Dataset.from_tensor_slices( (X_test, y_test) )

X_valid[0].shape

X_valid[0]

... ...

valid_set.take(1)

BytesList = tf.train.BytesList
Int64List = tf.train.Int64List

Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example


def create_example(image, label):
    image_data = tf.io.serialize_tensor(image)
    # encode an image using the JPEG format and put this binary data in a BytesList.
    # image_data = tf.io.encode_jpeg( image[..., np.newaxis] ) # OR image[:,:, np.newaxis].shape ==> (28, 28, 1)
    print('image_data:',image_data)    # can be removed
    print('------------------------')  # can be removed
    return Example(
        features = Features(
            feature = {
                'image': Feature( bytes_list=BytesList( value=[image_data.numpy()] ) ),
                'label': Feature( int64_list=Int64List( value=[label.numpy()] ) )
            }
        )
    )

for image, label in valid_set.take(1):
    print( create_example(image, label) )

After image_data = tf.io.serialize_tensor(image)

After bytes_list=BytesList( value=[image_data.numpy()] )

The following function saves a given dataset to a set of TFRecord files. The examples are written to the files in a round-robin fashion. To do this, we enumerate all the examples using the dataset.enumerate() method, and we compute index % n_shards to decide which file to write to. We use the standard contextlib.ExitStack class to make sure that all writers are properly closed whether or not an I/O error occurs while writing.

from contextlib import ExitStack

def write_tfrecords(name, dataset, n_shards=10):
    paths=[ "{}.tfrecord-{:05d}-of-{:05d}".format(name, index, n_shards) 
           for index in range(n_shards) ] 
    
    with ExitStack() as stack:
        writers = [stack.enter_context(tf.io.TFRecordWriter(path))
                   for path in paths]
        
        for index, (image, label) in dataset.enumerate():
            shard = index % n_shards
            example = create_example(image, label)
            writers[shard].write(example.SerializeToString())
    return paths

train_filepaths = write_tfrecords("my_fashion_mnist.train", train_set)
valid_filepaths = write_tfrecords("my_fashion_mnist.valid", valid_set)
test_filepaths = write_tfrecords("my_fashion_mnist.test", test_set)

train_filepaths

Then use tf.data to create an efficient dataset for each set.

def preprocess(tfrecord):
    
    feature_descriptions = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""), #OR tf.io.VarLenFeature(tf.string)
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=-1)
    }
    example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    
    #out_type=tf.uint8 since RGB(0~255,0~255,0~255) and 2^8=256 is enough
    image = tf.io.parse_tensor(example["image"], out_type=tf.uint8) # since image_data = tf.io.serialize_tensor(image)
    # image = tf.io.decode_jpeg(example["image"]) # since image_data = tf.io.encode_jpeg( image[..., np.newaxis] )
    
    image = tf.reshape(image, shape=[28,28]) # <== image[:,:, np.newaxis].shape ==> (28, 28, 1)
    return image, example["label"]

#read data from tfrecords files    
def mnist_dataset(filepaths, 
                  n_read_threads=5, # num_parallel_reads
                  shuffle_buffer_size=None,
                  n_parse_threads=5, 
                  batch_size=5, 
                  cache=True):
    
    dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=n_read_threads)
    
    if cache:
        dataset = dataset.cache()
        
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads) #for each record
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

train_set = mnist_dataset(train_filepaths, shuffle_buffer_size=60000)
valid_set = mnist_dataset(valid_filepaths)
test_set = mnist_dataset(test_filepaths)

train_set.take(1)

image = tf.io.parse_tensor(example["image"], out_type=tf.uint8) # since image_data = tf.io.serialize_tensor(image)

################################################################ VS tensorflow_datasets

import tensorflow_datasets as tfds

fashion_mnist_bldr=tfds.builder('fashion_mnist')

# Download the data, prepare it, and write it to disk
fashion_mnist_bldr.download_and_prepare()

# Load data from disk as tf.data.Datasets
fashion_mnist_datasets = fashion_mnist_bldr.as_dataset(shuffle_files=False)

fashion_mnist_datasets.keys()

C:\Users\LlQ\tensorflow_datasets\fashion_mnist\3.0.1\dataset_info.json

{
  "citation": "@article{DBLP:journals/corr/abs-1708-07747,\n  author    = {Han Xiao and\n               Kashif Rasul and\n               Roland Vollgraf},\n  title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning\n               Algorithms},\n  journal   = {CoRR},\n  volume    = {abs/1708.07747},\n  year      = {2017},\n  url       = {http://arxiv.org/abs/1708.07747},\n  archivePrefix = {arXiv},\n  eprint    = {1708.07747},\n  timestamp = {Mon, 13 Aug 2018 16:47:27 +0200},\n  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1708-07747},\n  bibsource = {dblp computer science bibliography, https://dblp.org}\n}",
  "description": "Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.",
  "downloadSize": "30878645",
  "location": {
    "urls": [
      "https://github.com/zalandoresearch/fashion-mnist"
    ]
  },
  "name": "fashion_mnist",
  "splits": [
    {
      "name": "test",
      "numBytes": "5470824",
      "shardLengths": [
        "10000"
      ]
    },
    {
      "name": "train",
      "numBytes": "32717636",
      "shardLengths": [
        "60000"
      ]
    }
  ],
  "supervisedKeys": {
    "input": "image",
    "output": "label"
  },
  "version": "3.0.1"
}

C:\Users\LlQ\tensorflow_datasets\fashion_mnist\3.0.1\image.image.json

{"encoding_format": "png", "shape": [28, 28, 1]}

import tensorflow as tf
fashion_mnist__train = fashion_mnist_datasets['train']
assert isinstance(fashion_mnist__train, tf.data.Dataset)

fashion_mnist__train.take(1)

fashion_mnist__train_set = mnist_dataset(r'C:\Users\LlQ\tensorflow_datasets\fashion_mnist\3.0.1\fashion_mnist-train.tfrecord-00000-of-00001')
fashion_mnist__train_set.take(1)

################################################################

import matplotlib.pyplot as plt

for X,y in train_set.take(1):
    for i in range(5):
        plt.subplot(1,5, i+1)
        plt.imshow(X[i].numpy(), cmap="binary")
        plt.axis("off")
        plt.title(str(y[i].numpy()))

Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
        
    def call(self, inputs):
        return (inputs-self.means_) / (self.stds_ + keras.backend.epsilon())
    
standardization = Standardization(input_shape=[28,28])
sample_image_batches = train_set.map(lambda image, label: image)
sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()), 
                               axis=0).astype(np.float32)
#len(sample_images) == 5 * 100 batches
standardization.adapt(sample_images)

model = keras.models.Sequential([
    standardization,
    keras.layers.Flatten(), #1D
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer='nadam', 
              metrics=["accuracy"])

from datetime import datetime
import os
import math

logs = os.path.join(os.curdir, "my_logs", "run_"+datetime.now().strftime("%Y%m%d_%H%M%S"))

tensorboard_cb = tf.keras.callbacks.TensorBoard(log_dir=logs,
                                                histogram_freq=1, 
                                                profile_batch=10)

batch_size=32 #default    
model.fit(train_set, steps_per_epoch=np.ceil(len(X_train)/batch_size),
          epochs=5, 
          validation_data=valid_set, callbacks=[tensorboard_cb])

Warning: The profiling tab in TensorBoard works if you use TensorFlow 2.2+. You also need to make sure tensorboard_plugin_profile is installed (and restart Jupyter if necessary).

%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006

Reason: No GPU

device_name = tf.test.gpu_device_name()
if not device_name:
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

################################################################################

################################################################################
10. In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

a. Download the Large Movie Review Dataset http://ai.stanford.edu/~amaas/data/sentiment/, which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise.
```
from pathlib import Path
from tensorflow import keras

DOWNLOAD_ROOT = "http://ai.stanford.edu/~amaas/data/sentiment/"
FILENAME = "aclImdb_v1.tar.gz"
filepath = keras.utils.get_file(FILENAME, 
                                DOWNLOAD_ROOT + FILENAME, extract=True)
path = Path(filepath).parent / "aclImdb"
path
```
==>

https://www.geeksforgeeks.org/os-walk-python/

Unzipped file directory tree
OS.walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
root : Prints out directories only from what you specified.
dirs : Prints out sub-directories from root.
files : Prints out all files from root and directories.

# Driver function 
import os 
if __name__ == "__main__": 
	for (root,dirs,files) in os.walk('Test', topdown=true): 
		print root 
		print dirs 
		print files 
		print '--------------------------------'

path.parts

import os

#name: current directory OR folder
for name, subdirs, files in os.walk(path):
    indent = len(Path(name).parts) - len(path.parts)
    print("    " *indent + Path(name).parts[-1] + os.sep)
    for index, filename in enumerate( sorted(files) ):
        if index == 3:
            print("    " *(indent+1) + "...")
            break; #exist for loop
        print("    " *(indent+1) + filename)

Files Access path

def review_paths(dirpath):
    return [str(path) for path in dirpath.glob("*.txt")]

train_pos = review_paths(path / "train" / "pos") # WindowsPath('C:/Users/LlQ/.keras/datasets/aclImdb/train/pos')
train_neg = review_paths(path / "train" / "neg")

test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")

len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

b. Split the test set into a validation set (15,000) and a test set (10,000).

import numpy as np

np.random.shuffle(test_valid_pos)

test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]

valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]

c. Use tf.data to create an efficient dataset for each set.
Since the dataset fits in memory, we can just load all the data using pure Python code and use tf.data.Dataset.from_tensor_slices(): # reading all files

import tensorflow as tf

def imdb_dataset(filepaths_positive, filepaths_negative):
    reviews = []
    labels = []
    
    for filepaths, label in ( (filepaths_negative, 0), 
                              (filepaths_positive, 1) ):
        for filepath in filepaths:
            with open(filepath, encoding="utf8") as review_file:
                reviews.append( review_file.read() )
            labels.append(label)
            
    return tf.data.Dataset.from_tensor_slices(
                                                ( tf.constant(reviews), 
                                                  tf.constant(labels) 
                                                )
            )

for X,y in imdb_dataset(train_pos, train_neg).take(1):
    print(X)
    print(y)
    print()

13_Loading & Prep 4_[..., np.newaxis]_ExitStack_walk_file操作_timeit_正则_feature vector embed_Text Toke_第15张图片

%timeit -r1 for X, y in imdb_dataset(train_pos, train_neg).repeat(10): pass

It takes about 1 minute 26 seconds to load the dataset and go through it 10 times.

TFRecords¶

But let's pretend the dataset does not fit in memory, just to make things more interesting. Luckily, each review fits on just one line (they use to indicate line breaks), so we can read the reviews using a TextLineDataset. If they didn't we would have to preprocess the input files (e.g., converting them to TFRecords). For very large datasets, it would make sense a tool like Apache Beam for that. # reading data batch by batch and multiple parallels

def imdb_dataset(filepaths_positive, filepaths_negative, n_read_threads=5):
    dataset_neg = tf.data.TextLineDataset( filepaths_negative, num_parallel_reads=n_read_threads )
    dataset_neg = dataset_neg.map( lambda review: (review, 0) )
    
    dataset_pos = tf.data.TextLineDataset( filepaths_positive, num_parallel_reads=n_read_threads )
    dataset_pos = dataset_pos.map( lambda review: (review, 1) )
    
    return tf.data.Dataset.concatenate( dataset_pos, dataset_neg )

%timeit -r1 for X,y in imdb_dataset(train_pos, train_neg).repeat(10): pass

Now it takes about 2 minutes 54 seconds to go through the dataset 10 times. That's much slower, essentially because the dataset is not cached in RAM, so it must be reloaded at each epoch. If you add .cache() just before .repeat(10), you will see that this implementation will be about as fast as the previous one.

%timeit -r1 for X,y in imdb_dataset(train_pos, train_neg).cache().repeat(10): pass

batch_size = 32

train_set = imdb_dataset(train_pos, train_neg).shuffle(25000).batch(batch_size).prefetch(1)
valid_set = imdb_dataset(valid_pos, valid_neg).batch(batch_size).prefetch(1)
test_set = imdb_dataset(test_pos, test_neg).batch(batch_size).prefetch(1)
for X,y in train_set.take(1):
    print(X)# print(len(X)) # 32
    print(y)# print(len(y)) # 32
    print()

d. Create a binary classification model, using a TextVectorization layer to preprocess each review. If the TextVectorization layer is not yet available (or if you like a challenge), try to create your own custom preprocessing layer: you can use the functions in the tf.strings package, for example lower() to make everything lowercase, regex_replace() to replace punctuation with spaces, and split() to split words on spaces. You should use a lookup table to output word indices, which must be prepared in the adapt() method.

Let's first write a function to preprocess the reviews, cropping them to 300 characters, converting them to lower case, then replacing and all non-letter characters to spaces, splitting the reviews into words, and finally padding or cropping each review so it ends up with exactly n_words tokens:
#################################################
https://blog.csdn.net/Linli522362242/article/details/87780336
https://blog.csdn.net/Linli522362242/article/details/90139705
https://blog.csdn.net/Linli522362242/article/details/103866244
/> ==> "\\s*/?>"
all non-letter characters ==> b"[^a-z]"
X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"]) tf.shape(X_example)

tf.shape(X_example)*tf.constant([1,0])

tf.shape(X_example)*tf.constant([1,0])+ tf.constant([0, 50])

#################################################

tokenizer

def preprocess(X_batch, n_words=50): shape = tf.shape(X_batch) * tf.constant([1,0]) + tf.constant([0, n_words])# ==> shape(batches, n_words) Z = tf.strings.substr(X_batch, 0, 300) # cropping them to 300 characters # matches only one “” Z = tf.strings.lower(Z) # "\" : An escape character Z = tf.strings.regex_replace(Z, b"", b" ") # "\s": white space, "\S": non-white space OR ([^\s]) Z = tf.strings.regex_replace(Z, b"[^a-z]", b" ") # replace all non-letter characters to spaces Z = tf.strings.split(Z) #splitting the reviews into words return Z.to_tensor(shape=shape, default_value=b"") X_example = tf.constant(["It's a great, great movie! I loved it.", "It was terrible, run away!!!"]) preprocess(X_example)

Now let's write a second utility function that will take a data sample with the same format as the output of the preprocess() function, and will output the list of the top max_size most frequent words, ensuring that the padding token is first:

get_vocabulary

from collections import Counter def get_vocabulary( data_sample, max_size=1000): preprocess_reviews = preprocess(data_sample).numpy() counter = Counter() for words in preprocess_reviews: for word in words: if word != b"" : counter[word] += 1 return [b""] + [word for word, count in counter.most_common(max_size)] get_vocabulary(X_example)

Now we are ready to create the TextVectorization layer. Its constructor just saves the hyperparameters (max_vocabulary_size and n_oov_buckets). The adapt() method computes the vocabulary using the get_vocabulary() function, then it builds a StaticVocabularyTable (see Chapter 16 for more details). The call() method preprocesses the reviews to get a padded list of words for each review, then it uses the StaticVocabularyTable to lookup the index of each word in the vocabulary:

self.n_oov_buckets:  specify the number of out-of-vocabulary (oov) buckets. If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets.

TextVectorization

class TextVectorization(keras.layers.Layer): def __init__(self, max_vocabulary_size=1000, n_oov_buckets=100, dtype=tf.string, **kwargs): super().__init__(dtype=dtype, **kwargs) self.max_vocabulary_size = max_vocabulary_size self.n_oov_buckets = n_oov_buckets def adapt(self, data_sample): self.vocab = get_vocabulary(data_sample, self.max_vocabulary_size) words = tf.constant(self.vocab) word_ids = tf.range( len(self.vocab), dtype=tf.int64 ) # https://blog.csdn.net/Linli522362242/article/details/107933572 vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids) self.table = tf.lookup.StaticVocabularyTable(vocab_init, self.n_oov_buckets) # builds a StaticVocabularyTable def call(self, inputs): preprocessed_inputs = preprocess(inputs) #reviews ==> words array return self.table.lookup(preprocessed_inputs)

Let's try it on our small X_example we defined earlier:

text_vectorization = TextVectorization() text_vectorization.adapt(X_example) text_vectorization(X_example)

Looks good! As you can see, each review was cleaned up and tokenized, then each word was encoded as its index in the vocabulary (all the 0s correspond to the  tokens)

Now let's create another TextVectorization layer and let's adapt it to the full IMDB training set (if the training set did not fit in RAM, we could just use a smaller sample of the training set by calling train_set.take(500)):

max_vocabulary_size = 1000 n_oov_buckets = 100 sample_review_batches = train_set.map(lambda review, label: review) #tensor: [batch, batch, ...] sample_reviews = np.concatenate( list( sample_review_batches.as_numpy_iterator() ), axis=0) # to one list text_vectorization = TextVectorization(max_vocabulary_size, n_oov_buckets, input_shape=[]) text_vectorization.adapt(sample_reviews) text_vectorization(X_example)

Good! Now let's take a look at the first 10 words in the vocabulary:

text_vectorization.vocab[:10]

These are the most common words in the reviews.

bags of words
Now to build our model we will need to encode all these word IDs somehow. One approach is to create bags of words: for each review, and for each word in the vocabulary, we count the number of occurences of that word in the review. For example:
simple_example = tf.constant([ [1,3,1,0,0], [2,2,0,0,0] ]) tf.one_hot(simple_example, 4)

simple_example = tf.constant([ [1,3,1,0,0], [2,2,0,0,0] ]) tf.reduce_sum( tf.one_hot(simple_example, 4), axis=1)

array([[2., 2., 0., 1.], # since two-0s, two-1s, no 2, one-3s [3., 0., 2., 0.]], dtype=float32)> # since three-0s, no 1, two-2s, no 3 The first review has 2 times the word 0, 2 times the word 1, 0 times the word 2, and 1 time the word 3, so its bag-of-words representation is [2, 2, 0, 1]. Similarly, the second review has 3 times the word 0, 0 times the word 1, and so on. Let's wrap this logic in a small custom layer, and let's test it. We'll drop the counts for the word 0, since this corresponds to the  token, which we don't care about.

class BagOfWords(keras.layers.Layer): def __init__(self, n_tokens, dtype=tf.int32, **kwargs): super().__init__(dtype=tf.int32, **kwargs) self.n_tokens = n_tokens def call(self, inputs): one_hot = tf.one_hot(inputs, self.n_tokens) return tf.reduce_sum(one_hot, axis=1) [:,1:] # Let's test it: bag_of_words = BagOfWords(n_tokens=4) bag_of_words( simple_example )

It works fine! Now let's create another BagOfWord with the right vocabulary size for our training set:
n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for bag_of_words = BagOfWords(n_tokens)
We're ready to train the model!
keras.backend.clear_session() tf.random.set_seed(42) np.random.seed(42) model = keras.models.Sequential([ text_vectorization, bag_of_words, keras.layers.Dense(100, activation="relu"), keras.layers.Dense(1, activation="sigmoid"), ]) model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"]) model.fit(train_set, epochs=5, validation_data=valid_set)

We get about 73.5% accuracy on the validation set after just the first epoch, but after that the model makes no progress. We will do better in Chapter 16. For now the point is just to perform efficient preprocessing using tf.data and Keras preprocessing layers.

Embedding

e. Add an Embedding layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model.

To be precise, the sentence(here, one sentence per review) embedding is equal to the mean word embedding multiplied by the square root of the number of words in the sentence. This compensates for the fact that the mean of n vectors gets shorter as
n grows.

To compute the mean embedding for each review, and multiply it by the square root of the number of words in that review, we will need a little function:

# another_example[review1[ 1_word_dimesions[], 1_word_dimesions[], 1_word_dimensions[] ], # review2[ 1_word_dimesions[], 1_word_dimesions[], 1_word_dimensions[] ] # ] #d1 d2 d3 # another_example = tf.constant([ [ [1., 2., 3.], # word # [4., 5., 0.], # word # [0., 0., 0.] # word # ],# one sentence(1 line) represents one review # [ [6., 0., 0.], # [0., 0., 0.], # [0., 0., 0.] # ] # ]) # # not_pad = tf.math.count_nonzero(inputs, axis=-1)######################## # tf.math.count_nonzero(another_example, axis=-1) # # # n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True)######## # tf.math.count_nonzero(tf.math.count_nonzero(another_example, axis=-1), # axis=-1, # keepdims=True) # #The first review contains 2 words (the last token is a zero vector, #which represents the  token). The second review contains 1 word. # # sqrt_n_words = tf.math.sqrt( tf.cast(n_words, tf.float32) )############## # tf.math.sqrt( tf.cast(tf.math.count_nonzero(tf.math.count_nonzero(another_example, axis=-1), # axis=-1, # keepdims=True), tf.float32) ) # # # return tf.reduce_mean(inputs, axis=1) * sqrt_n_words #################### # tf.reduce_mean(another_example, axis=1) # # 1.4142135*1.6666666 = 2.3570224 # 1.4142135*2.3333333 = 3.2998314 # tf.reduce_mean(another_example, axis=1) * tf.math.sqrt( tf.cast(tf.math.count_nonzero(tf.math.count_nonzero(another_example, axis=-1), # axis=-1, # keepdims=True), # tf.float32) # ) # def compute_mean_embedding(inputs): not_pad = tf.math.count_nonzero(inputs, axis=-1) n_words = tf.math.count_nonzero(not_pad, axis=-1, keepdims=True) sqrt_n_words = tf.math.sqrt( tf.cast(n_words, tf.float32) ) return tf.reduce_mean(inputs, axis=1) * sqrt_n_words another_example = tf.constant([ [ [1., 2., 3.], [4., 5., 0.], [0., 0., 0.] ], [ [6., 0., 0.], [0., 0., 0.], [0., 0., 0.] ] ]) compute_mean_embedding(another_example)

Let's check that this is correct. The first review contains 2 words (the last token is a zero vector, which represents the  token). The second review contains 1 word. So we need to compute the mean embedding for each review, and multiply the first one by the square root of 2, and the second one by the square root of 1:
tf.reduce_mean( another_example, axis=1) * tf.sqrt([[2.], [1.] ])

Perfect. Now we're ready to train our final model. It's the same as before, except we replaced the BagOfWords layer with an Embedding layer followed by a Lambda layer that calls the compute_mean_embedding layer:

embedding_size=20 model = keras.models.Sequential([ text_vectorization, keras.layers.Embedding(input_dim=n_tokens, #n_tokens = max_vocabulary_size + n_oov_buckets + 1 # add 1 for output_dim=embedding_size, mask_zero=True), # tokens => zero vectors keras.layers.Lambda(compute_mean_embedding), keras.layers.Dense(100, activation="relu"), keras.layers.Dense(1, activation="sigmoid") ])

f. Train the model and see what accuracy you get. Try to optimize your pipelines to make training as fast as possible.

model.compile(loss="binary_crossentropy", optimizer="nadam", metrics=["accuracy"]) model.fit(train_set, epochs=5, validation_data= valid_set)

The model is not better using embeddings (but we will do better in Chapter 16). The pipeline looks fast enough (we optimized it earlier).

g. Use TFDS to load the same dataset more easily: tfds.load("imdb_reviews").
import tensorflow_datasets as tfds datasets = tfds.load(name="imdb_reviews") train_set, test_set = datasets["train"], datasets["test"]

for example in train_set.take(1): print(example["text"]) print(example["label"])

C++正则表达式语法 Coding小公仔 c/c++c++正则表达式开发语言
在C++中，正则表达式是处理文本模式匹配和字符串操作的强大工具。C++11及以后的标准库提供了头文件，支持正则表达式的使用。下面是C++正则表达式的核心语法规则和用法：一、基本正则表达式语法1.普通字符直接匹配自身，例如：a匹配字符a。2.元字符（需转义）具有特殊含义的字符，需用反斜杠\转义（在C++字符串中需用双反斜杠\\）。.：匹配除换行符外的任意字符。^：匹配字符串的开头。$：匹配字符串的结
windows mysql主从备份_windows下mysql主从备份设置韩山云客 windows mysql主从备份
Windowsserver2008mysql主从数据设置步骤：一、安装MySQL说明：在两台MySQL服务器192.168.21.169和192.168.21.168上分别进行如下操作，安装MySQL5.5.22二、配置MySQL主服务器(192.168.21.169)mysql-uroot-p#进入MySQL控制台createdatabaseosyunweidb;#建立数据库osyunweidb
mysql主从备份_mysql实现主从备份 Lucas HC mysql主从备份
mysql主从备份的原理:主服务器在做数据库操作的时候将所有的操作通过日志记录在binlog里面，有专门的文件存放。如localhost-bin.000003，这种，从服务器和主服务配置好关系后，通过I/O线程获取到这个binlog文件然后写入到从服务器的relaylog(中继日志)中，然后从服务器执行从服务器中的sql语句进行数据库的同步。实现：准备:两台服务器，mysql环境，可以是Windo
【iOS越狱开发】iOS越狱步骤1之环境搭建 JR_Wang2491 MAC 移动苹果 ios ios iphone ipad
这段时间都是研究iOS越狱事情，如今我会一点一点的把自己学到的遇到的问题会陆续编写出来，让大家一起讨论，也让做逆向的朋友有个交流平台机会，废话不多说！！一、学习条件至少1~2年iOS开发经验基本UI界面操作多线程网络基本操作数据储存基本操作一台苹果手机，建议至少iPhone5S（因为从5S开始支持arm64架构）或者至少是iPadAir、iPadmini2等支持arm64架构的设备系统至少iOS8
MySQL主从备份 W111115_ MySQL mysql 数据库
前提条件：安装mysql,并开启二进制日志（bin-log日志）【让一台的bin-log日志传到另一台主机上，然后第二台主机收到后，将其bin-log日志读取并恢复到第二台机器上---整个过程实时操作同步】实现过程1.主从机器都开启二进制日志主服务器：vim/etc/my.cnf#编辑mysql配置文件log-bin=mysql-bin#开启二进制日志--------在配置文件中添加server-
iPhone越狱基本流程王景程 github iphone xcode macos
目录一、什么是越狱（Jailbreak）？二、越狱前的准备工作三、越狱方式总览（按iOS版本划分）越狱类型：主流越狱工具一览：四、以Checkra1n为例讲解越狱流程（适合iPhoneX及更早）✅支持设备（iOS12–14）：步骤：五、越狱后的操作（以Cydia为例）⚠️六、越狱风险与注意事项总结流程图：一、iPhone16+iOS26：是否可以越狱？当前情况（截至2025年中）：二、为何新设备（
ref() 与 reactive() 前端岳大宝前端框架Vue javascript 前端 vue.js
下面，我们来系统的梳理关于ref()与reactive()的基本知识点：一、响应式编程核心概念1.1什么是响应式编程？响应式编程是一种声明式编程范式，它使数据变化能够自动传播到依赖它的代码部分。在Vue中，响应式系统实现了：数据驱动视图：数据变化自动更新DOM依赖追踪：自动跟踪数据依赖关系高效更新：最小化不必要的DOM操作1.2Vue响应式系统演进版本响应式实现特点Vue2Object.defin
带标签的 Docker 镜像打包为 tar 文件大霸王龙 docker 容器运维
现在还有人用docker吗要将带标签的Docker镜像打包为tar文件，请使用dockersave命令。以下是详细操作指南：一、单镜像打包（推荐方式）#基础格式dockersave-o[输出文件名].tar[镜像名]:[标签]#示例：将my-app:1.0保存为app-backup.tardockersave-oapp-backup.tarmy-app:1.0二、多镜像打包#同时打包多个镜像到单个
常见的会话劫持攻击是指什么？ wanhengidc 安全网络 web安全
会话劫持攻击是一种常见的网络安全攻击，恶意攻击者通过窃取用户的会话标识符号来接管用户的会话，当攻击者或者有效的会话标识符，那么就可以借取正常用户的数据信息，来访问目标用户的账号，并进行各种操作，来修改或者盗取重要的数据信息，以此来给用户造成巨大的经济损失。所以企业对于会话劫持攻击，可以选择定期更新和修补系统漏洞来保护用户的数据安全，及时更新操作系统、应用程序和安全组件，以此来修复已知的服务器安全漏
Java中的批处理优化：使用Spring Batch处理大规模数据的实践微赚淘客系统开发者@聚娃科技 java spring batch
Java中的批处理优化：使用SpringBatch处理大规模数据的实践大家好，我是微赚淘客返利系统3.0的小编，是个冬天不穿秋裤，天冷也要风度的程序猿！在处理大规模数据的场景中，批处理是一个非常常见且必要的操作。Java中的SpringBatch是一个强大的框架，能够帮助我们高效地执行复杂的批处理任务。本文将带大家了解如何使用SpringBatch处理大规模数据，并通过代码示例展示如何实现高效的批
Spring Batch ：高效处理海量数据的利器一叶飘零_sweeeet Springboot spring boot
SpringBatch是Spring框架中一个功能强大的批处理框架，旨在帮助开发人员轻松处理大量数据的批量操作，比如数据的导入、导出、转换以及定期的数据清理等任务。它提供了一套完善且灵活的机制，使得原本复杂繁琐的数据批处理工作变得条理清晰、易于管理和扩展。接下来，我们将全方位深入探究SpringBatch，从其核心概念、架构组成，到具体的使用示例以及在不同场景下的应用优势等，带你充分领略它的魅力所
MySQL事务深度解析：原理、优化及最佳实践木木丰 mysql mysql 数据库 java windows
MySQL中的事务（Transaction）是数据库操作的基本单位，它代表着一组逻辑上相互关联的操作，要么全部成功，要么全部失败。这种“要么全做，要么全不做”的特性确保了数据库的完整性和一致性。事务在MySQL中扮演着至关重要的角色，特别是在处理复杂业务逻辑和并发访问时。下面将详细探讨MySQL事务的概念、使用方法、注意事项以及在实际应用中的最佳实践。一、事务的概念事务是一个不可分割的工作逻辑单元
ArkTS与仓颉语言的深度解析（鸿蒙操作系统多设备）爱学习的小齐哥哥仓颉华为仓颉 HarmonyOS5
一、引言随着物联网和智能设备的飞速发展，多设备协同开发成为当前软件开发领域的重要课题。鸿蒙操作系统作为面向全场景的分布式操作系统，为开发者提供了ArkTS和仓颉语言两种强大的开发工具，助力实现高效的多设备应用开发。本文将全面剖析这两种语言在鸿蒙多设备开发中的应用，探讨其优势、开发环境、实现一次开发多端部署的方法以及在不同设备上的性能表现和适配策略，并结合智能驾驶应用场景进行实例分析。二、ArkTS
Git使用基本指南 LEIX_lll git
一、Git基础配置首先需要配置用户信息，让Git知道你是谁：gitconfig--globaluser.name"你的名字"gitconfig--globaluser.email"你的邮箱@example.com"如果需要查看配置信息，可以使用：gitconfig--list二、仓库操作1.创建新仓库gitinit该命令会在当前目录下创建一个新的Git仓库。2.克隆已有仓库gitclone[远程仓
[Python] 使用 dataclass 简化数据结构：定义、功能与实战踏雪无痕老爷子 Python python 开发语言
在经典面向对象编程中，为了保存和操作数据往往需要定义多个类，手写__init__()、__repr__()、__eq__()等方法。Python3.7引入了@dataclass装饰器，它能自动生成这些常见方法，大幅减少样板代码。本文将介绍dataclass的定义与参数、比较与普通类的差别、实战示例，以及常见注意事项。一、什么是dataclass@dataclass是一种类装饰器，它通过类成员的类型
[Python]-基础篇1- 从零开始的Python入门指南踏雪无痕老爷子 Python python 开发语言
无论你是尚未接触编程的新手，还是想从其他语言转向Python的开发者，这篇文章都是你的入门课。一、Python是什么？Python是一种解释型、高级、通用型编程语言，以简洁明了、简单易用着称。它可以应用于网站开发、自动化脚本、数据分析、人工智能、系统操作等多种场景。二、如何安装Python步骤：访问Python官方网站选择目前最新的Python3.x版本下载Windows用户请务必勾选“AddPy
股票程序化交易软件如何选择？这些要点你知道吗股票程序化交易接口量化交易股票API接口 Python股票量化交易区块链股票程序化交易软件功能特性稳定性成本股票量化接口股票API接口
Python股票接口实现查询账户，提交订单，自动交易（1）Python股票程序交易接口查账，提交订单，自动交易（2）股票量化，Python炒股，CSDN交流社区>>>了解软件功能特性基础交易功能基础交易功能是股票程序化交易软件的核心。它应具备快速下单、撤单等基础操作能力。比如在行情快速变化时，能让投资者迅速抓住机会下单，或者及时撤单避免损失。软件的交易界面要简洁明了，方便投资者操作。还应支持多种交
Linux I/O 文件操作详解：从系统调用到实际工程应用平凡灵感码头 linux学习 linux 运维服务器
一、写在前面在Linux或任何类Unix操作系统中，文件是一切的核心——无论是硬盘上的文本文件，还是串口设备、GPIO寄存器、甚至网络接口，几乎都被抽象为“文件”。理解Linux下的I/O文件操作机制，不仅是嵌入式开发的基础，也是进行系统编程与底层控制的关键。二、I/O的本质：一切皆文件Linux将外设抽象成文件的方式，统一了对各种资源的操作模型。你可以用open打开串口设备/dev/ttyS0，
Python实战：自动在知乎回答点赞并采集内容的高阶爬虫教程 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 okhttp 学习
✨写在前面：为什么做知乎自动化操作？知乎作为中国领先的知识问答平台，拥有大量结构化内容。对于研究舆情分析、情绪识别、用户画像，甚至产品舆情反馈采集的用户来说，如何自动获取知乎内容并进行交互行为（如点赞、回答），是一个非常实用的能力。本文将手把手带你用Python完成以下目标：✅自动登录知乎✅自动搜索某个关键词下的热门问题✅自动点赞高质量回答✅自动采集回答内容（文本、点赞数、评论数等）✅自动保存为本
Memfault 简介及在Nordic nRF91 系列 DK的应用
1：Memfault是一个云平台，它允许您和您的团队持续监控设备、调试固件问题，并将OTA更新部署到您的设备群，从而以软件的速度交付硬件产品。Memfault以嵌入式优先：支持运行在任何实时操作系统（RTOS）或Android、Linux等操作系统上的嵌入式系统和设备它适用于任何设备：从功能强大的SoC一直到功能受限的MCU，Memfault都能适配您设备的可用闪存、RAM和带宽我们的SDK是专为
鸿蒙HarmonyOS应用开发之在非ArkTS线程中回调ArkTS接口「已注销」 harmonyOS 移动开发鸿蒙开发 harmonyos 鸿蒙鸿蒙开发组件化 ui Arkts 移动开发
场景介绍ArkTS是单线程语言，通过NAPI接口对ArkTS对象的所有操作都须保证在同一个ArkTS线程上进行。本示例将介绍通过napi_get_uv_event_loop和uv_queue_work实现在非ArkTS线程中通过NAPI接口回调ArkTS函数。使用示例接口声明、编译配置以及模块注册接口声明//index.d.tsexportconstqueueWork:(cb:(arg:numbe
Shell 编程之正则表达式与文本处理器
目录一：正则表达式二：基础正则表达式1.基础正则表达式示例（1）查找特定字符（2）利用中括号“[]”来查找集合字符（3）查找行首“^”与行尾字符“$”（4）查找任意一个字符“.”与重复字符“*”（5）查找连续字符范围“{}”2.元字符总结3.扩展正则表达式二：文本处理器1.sed工具（1）输出符合条件的文本(p表示正常输出)（2）删除符合条件的文本(d)（3）替换符合条件的文本（4）迁移符合条件的
Nginx 核心功能 ?ccc? nginx 服务器网络
目录一：正向代理1：编译安装Nginx（1）安装支持软件（2）创建运行用户、组和日志目录（3）编译安装Nginx（4）添加Nginx系统服务2：配置正向代理二：反向代理1：配置nginx七层代理2：配置nginx四层代理三：Nginx缓存1：缓存功能的核心原理和缓存类型2：代理缓存功能设置四：Nginxrewrite和正则一：正向代理正向代理（ForwardProxy）是一种位于客户端和原始服务器
Nordic智能楼宇自动化系统方案/nrf-knx-iot Halfway-- Product 物联网 iot
1:KNXIoT通过物联网（IoT）的强大功能和灵活性扩展了KNX标准的能力。因此，它允许KNX设备与物联网设备和云服务集成，从而能够创建先进的智能楼宇自动化系统。通过KNXIoT，设备可以在IP网络上进行通信，从而在设备连接和控制方式上提供更大的灵活性2:KNXIoT由3个主要负责数据互操作性的主要元素组成：KNXIoT第三方API一个标准化的API，通过一个抽象层连接KNX特定知识和第三方应用
GPT-4o重磅升级！只需一条指令，教你秒出SCI级专业科研图！智写AI AI学术写作指南信息可视化人工智能
经过数月爆肝，七哥终于完成专业的学术AI使用教程，估计也有个80万字的详细操作指南。分为多个细分的专业写作场景，跟着一步一步操作，借助ChatGPT做学术、干科研、写论文、课题申报都变得超简单。欢迎加我交流（yida985），祝你一臂之力。七哥之前写过关于用AI生成流程图的教程，不过需要借助其他软件才能搞定完美的流程图。近期GPT-4o全新推出了“生图功能”，这个生图的过程就更加方便轻松了，全能G
Python3 数字(Number) froginwe11 开发语言
Python3数字(Number)引言在编程语言中，数字是构成程序的基础元素之一。Python3作为一种高级编程语言，提供了丰富的数字类型和操作方法。本文将详细介绍Python3中的数字类型，包括整数、浮点数、复数等，并探讨它们的特性和应用。整数（Integer）整数是Python3中最基本的数据类型之一，用于表示没有小数部分的数值。在Python3中，整数类型没有大小限制，可以表示任意大小的整数
Ruby 字符串（String） froginwe11 开发语言
Ruby字符串（String）引言在编程语言中，字符串是处理文本数据的基础。Ruby作为一种动态、面向对象的语言，提供了丰富的字符串处理功能。本文将详细介绍Ruby中的字符串（String）类型，包括其基本用法、操作方法以及高级特性。字符串的基本概念在Ruby中，字符串是由一系列字符组成的序列。这些字符可以是字母、数字、标点符号等。字符串是不可变的，这意味着一旦创建，其内容就不能被修改。创建字符串
Linux操作系统，故障排查月堂 linux 运维服务器
案例1：GRUB引导故障故障现象：系统启动卡在"GRUB>"提示符，无法进入系统原因分析：GRUB配置文件损坏（/boot/grub/grub.cfg）引导文件被误删或磁盘损坏解决步骤：在GRUB命令行依次执行：insmodxfssetroot=(hd0,msdos1)linux/vmlinuz-root=/dev/sda1initrd/initramfs-.imgboot进入系统后执行：grub
基于摩尔线程 S80 显卡在 Ubuntu 系统下双卡交火部署 DeepSeek 流量留 Deepseek 人工智能
以下是基于摩尔线程S80显卡在Ubuntu系统下双卡交火部署DeepSeek的详细教程：###一、环境准备1.**操作系统**：推荐使用Ubuntu22.04。2.**显卡驱动**：-访问摩尔线程官网，登录账号后进入产品页面，找到软件部分下载MUSASDK。-安装显卡驱动，确保驱动版本与MUSASDK兼容。3.**安装Ollama**：-官方推荐使用命令安装Ollama，但下载速度可能较慢，可前往
linux mysql命令行操作
命令行,linux,命令行操作相关学习资料：https://edu.51cto.com/video/797.htmlhttps://edu.51cto.com/video/1400.htmlhttps://edu.51cto.com/video/3832.htmlLinuxMySQL命令行操作入门指南作为一名刚入行的开发者，掌握Linux系统下的MySQL命令行操作是一项基本技能。本文将带你一步步
Java实现的基于模板的网页结构化信息精准抽取组件：HtmlExtractor yangshangchuan 信息抽取 HtmlExtractor 精准抽取信息采集
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件，本身并不包含爬虫功能，但可被爬虫或其他程序调用以便更精准地对网页结构化信息进行抽取。 HtmlExtractor是为大规模分布式环境设计的，采用主从架构，主节点负责维护抽取规则，从节点向主节点请求抽取规则，当抽取规则发生变化，主节点主动通知从节点，从而能实现抽取规则变化之后的实时动态生效。如
java编程思想 -- 多态百合不是茶 java 多态详解
一: 向上转型和向下转型面向对象中的转型只会发生在有继承关系的子类和父类中（接口的实现也包括在这里）。父类：人子类：男人向上转型： Person p = new Man() ; //向上转型不需要强制类型转化向下转型： Man man =
[自动数据处理]稳扎稳打,逐步形成自有ADP系统体系 comsci dp
对于国内的IT行业来讲,虽然我们已经有了"两弹一星",在局部领域形成了自己独有的技术特征,并初步摆脱了国外的控制...但是前面的路还很长.... 首先是我们的自动数据处理系统还无法处理很多高级工程...中等规模的拓扑分析系统也没有完成,更加复杂的
storm 自定义日志文件商人shang storm cluster logback
Storm中的日志级级别默认为INFO，并且，日志文件是根据worker号来进行区分的，这样，同一个log文件中的信息不一定是一个业务的，这样就会有以下两个需求出现： 1. 想要进行一些调试信息的输出 2. 调试信息或者业务日志信息想要输出到一些固定的文件中不要怕，不要烦恼，其实Storm已经提供了这样的支持，可以通过自定义logback 下的 cluster.xml 来输
Extjs3 SpringMVC使用 @RequestBody 标签问题记录 21jhf
springMVC使用 @RequestBody(required = false) UserVO userInfo 传递json对象数据，往往会出现http 415，400,500等错误，总结一下需要使用ajax提交json数据才行，ajax提交使用proxy，参数为jsonData，不能为params；另外，需要设置Content-type属性为json，代码如下：（由于使用了父类aaa
一些排错方法文强chu 方法
1、java.lang.IllegalStateException: Class invariant violation at org.apache.log4j.LogManager.getLoggerRepository(LogManager.java:199)at org.apache.log4j.LogManager.getLogger(LogManager.java:228) at o
Swing中文件恢复我觉得很难小桔子 swing
我那个草了！老大怎么回事，怎么做项目评估的？只会说相信你可以做的，试一下，有的是时间！用java开发一个图文处理工具，类似word，任意位置插入、拖动、删除图片以及文本等。文本框、流程图等，数据保存数据库，其余可保存pdf格式。ok,姐姐千辛万苦，
php 文件操作 aichenglong PHP 读取文件写入文件
1 写入文件 @$fp=fopen("$DOCUMENT_ROOT/order.txt", "ab"); if(!$fp){ echo "open file error" ; exit; } $outputstring="date:"." \t tire:".$tire."
MySQL的btree索引和hash索引的区别 AILIKES 数据结构 mysql 算法
Hash 索引结构的特殊性，其检索效率非常高，索引的检索可以一次定位，不像B-Tree 索引需要从根节点到枝节点，最后才能访问到页节点这样多次的IO访问，所以 Hash 索引的查询效率要远高于 B-Tree 索引。可能很多人又有疑问了，既然 Hash 索引的效率要比 B-Tree 高很多，为什么大家不都用 Hash 索引而还要使用 B-Tree 索引呢
JAVA的抽象--- 接口 --实现百合不是茶
抽象接口实现接口 //抽象类 ,方法 //定义一个公共抽象的类 ,并在类中定义一个抽象的方法体抽象的定义使用abstract abstract class A 定义一个抽象类例如： //定义一个基类 public abstract class A{ //抽象类不能用来实例化，只能用来继承 //
JS变量作用域实例 bijian1013 作用域
<script> var scope='hello'; function a(){ console.log(scope); //undefined var scope='world'; console.log(scope); //world console.log(b);
TDD实践（二） bijian1013 java TDD
实践题目：分解质因数 Step1：单元测试： package com.bijian.study.factor.test; import java.util.Arrays; import junit.framework.Assert; import org.junit.Before; import org.junit.Test; import com.bijian.
[MongoDB学习笔记一]MongoDB主从复制 bit1129 mongodb
MongoDB称为分布式数据库，主要原因是1.基于副本集的数据备份， 2.基于切片的数据扩容。副本集解决数据的读写性能问题，切片解决了MongoDB的数据扩容问题。事实上，MongoDB提供了主从复制和副本复制两种备份方式，在MongoDB的主从复制和副本复制集群环境中，只有一台作为主服务器，另外一台或者多台服务器作为从服务器。本文介绍MongoDB的主从复制模式，需要指明
【HBase五】Java API操作HBase bit1129 hbase
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.ha
python调用zabbix api接口实时展示数据 ronin47
zabbix api接口来进行展示。经过思考之后，计划获取如下内容： 1、获得认证密钥 2、获取zabbix所有的主机组 3、获取单个组下的所有主机 4、获取某个主机下的所有监控项
jsp取得绝对路径 byalias 绝对路径
在JavaWeb开发中，常使用绝对路径的方式来引入JavaScript和CSS文件，这样可以避免因为目录变动导致引入文件找不到的情况，常用的做法如下：一、使用${pageContext.request.contextPath} 　　代码” ${pageContext.request.contextPath}”的作用是取出部署的应用程序名，这样不管如何部署，所用路径都是正确的。
Java定时任务调度：用ExecutorService取代Timer bylijinnan java
《Java并发编程实战》一书提到的用ExecutorService取代Java Timer有几个理由，我认为其中最重要的理由是：如果TimerTask抛出未检查的异常，Timer将会产生无法预料的行为。Timer线程并不捕获异常，所以 TimerTask抛出的未检查的异常会终止timer线程。这种情况下，Timer也不会再重新恢复线程的执行了;它错误的认为整个Timer都被取消了。此时，已经被
SQL 优化原则 chicony sql
一、问题的提出　在应用系统开发初期，由于开发数据库数据比较少，对于查询SQL语句，复杂视图的的编写等体会不出SQL语句各种写法的性能优劣，但是如果将应用系统提交实际应用后，随着数据库中数据的增加，系统的响应速度就成为目前系统需要解决的最主要的问题之一。系统优化中一个很重要的方面就是SQL语句的优化。对于海量数据，劣质SQL语句和优质SQL语句之间的速度差别可以达到上百倍，可见对于一个系统
java 线程弹球小游戏 CrazyMizzz java 游戏
最近java学到线程，于是做了一个线程弹球的小游戏，不过还没完善这里是提纲 1.线程弹球游戏实现 1.实现界面需要使用哪些API类 JFrame JPanel JButton FlowLayout Graphics2D Thread Color ActionListener ActionEvent MouseListener Mouse
hadoop jps出现process information unavailable提示解决办法 daizj hadoop jps
hadoop jps出现process information unavailable提示解决办法 jps时出现如下信息： 3019 -- process information unavailable3053 -- process information unavailable2985 -- process information unavailable2917 --
PHP图片水印缩放类实现 dcj3sjt126com PHP
<?php class Image{ private $path; function __construct($path='./'){ $this->path=rtrim($path,'/').'/'; } //水印函数，参数：背景图，水印图，位置，前缀,TMD透明度 public function water($b,$l,$pos
IOS控件学习：UILabel常用属性与用法 dcj3sjt126com ios UILabel
参考网站： http://shijue.me/show_text/521c396a8ddf876566000007 http://www.tuicool.com/articles/zquENb http://blog.csdn.net/a451493485/article/details/9454695 http://wiki.eoe.cn/page/iOS_pptl_artile_281
完全手动建立maven骨架 eksliang java eclipse Web
建一个 JAVA 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=App [-Dversion=0.0.1-SNAPSHOT] [-Dpackaging=jar] 建一个 web 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=web-a
配置清单 gengzg 配置
1、修改grub启动的内核版本 vi /boot/grub/grub.conf 将default 0改为1 拷贝mt7601Usta.ko到/lib文件夹拷贝RT2870STA.dat到 /etc/Wireless/RT2870STA/文件夹拷贝wifiscan到bin文件夹，chmod 775 /bin/wifiscan 拷贝wifiget.sh到bin文件夹，chm
Windows端口被占用处理方法 huqiji windows
以下文章主要以80端口号为例，如果想知道其他的端口号也可以使用该方法..........................1、在windows下如何查看80端口占用情况?是被哪个进程占用?如何终止等. 这里主要是用到windows下的DOS工具,点击"开始"--"运行",输入&
开源ckplayer 网页播放器，跨平台(html5, mobile)，flv, f4v, mp4, rtmp协议. webm, ogg, m3u8 ！天梯梦 mobile
CKplayer，其全称为超酷flv播放器，它是一款用于网页上播放视频的软件，支持的格式有：http协议上的flv,f4v,mp4格式，同时支持rtmp视频流格式播放，此播放器的特点在于用户可以自己定义播放器的风格，诸如播放/暂停按钮，静音按钮，全屏按钮都是以外部图片接口形式调用，用户根据自己的需要制作出播放器风格所需要使用的各个按钮图片然后替换掉原始风格里相应的图片就可以制作出自己的风格了，
简单工厂设计模式 hm4123660 java 工厂设计模式简单工厂模式
简单工厂模式（Simple Factory Pattern）属于类的创新型模式，又叫静态工厂方法模式。是通过专门定义一个类来负责创建其他类的实例，被创建的实例通常都具有共同的父类。简单工厂模式是由一个工厂对象决定创建出哪一种产品类的实例。简单工厂模式是工厂模式家族中最简单实用的模式，可以理解为是不同工厂模式的一个特殊实现。
maven笔记 zhb8015 maven
跳过测试阶段： mvn package -DskipTests 临时性跳过测试代码的编译： mvn package -Dmaven.test.skip=true maven.test.skip同时控制maven-compiler-plugin和maven-surefire-plugin两个插件的行为，即跳过编译，又跳过测试。指定测试类 mvn test
非mapreduce生成Hfile，然后导入hbase当中 Stark_Summer map hbase reduce Hfile path实例
最近一个群友的boss让研究hbase，让hbase的入库速度达到5w+/s，这可愁死了，4台个人电脑组成的集群，多线程入库调了好久，速度也才1w左右，都没有达到理想的那种速度，然后就想到了这种方式，但是网上多是用mapreduce来实现入库，而现在的需求是实时入库，不生成文件了，所以就只能自己用代码实现了，但是网上查了很多资料都没有查到，最后在一个网友的指引下，看了源码，最后找到了生成Hfile
jsp web tomcat 编码问题王新春 tomcat jsp pageEncode
今天配置jsp项目在tomcat上，windows上正常，而linux上显示乱码，最后定位原因为tomcat 的server.xml 文件的配置，添加 URIEncoding 属性： <Connector port="8080" protocol="HTTP/1.1" connectionTi

13_Loading & Prep 4_[..., np.newaxis]_ExitStack_walk_file操作_timeit_正则_feature vector embed_Text Toke

Files Access path

tokenizer

get_vocabulary

TextVectorization

bags of words

Embedding

你可能感兴趣的:(13_Loading & Prep 4_[..., np.newaxis]_ExitStack_walk_file操作_timeit_正则_feature vector embed_Text Toke)