本文介绍了在Ubuntu下如何手动下载mnist.npz数据集并且放到指定位置供tensorflow使用。
tensorflow版本:2.3
先写解决办法:
~/.keras/dataset/
目录下,命名为mnist.npz
完成。
在使用tensorflow的加载mnist的时候,发现需要从https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
这个网址下载,一看这个域名就不可能成功嘛!
想着我看看能不能把我之前下好的mnist数据集用上,于是看了下加载mnist数据集的源码,如下:
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""MNIST handwritten digits dataset.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
from tensorflow.python.keras.utils.data_utils import get_file
from tensorflow.python.util.tf_export import keras_export
@keras_export('keras.datasets.mnist.load_data')
def load_data(path='mnist.npz'):
"""Loads the [MNIST dataset](http://yann.lecun.com/exdb/mnist/).
This is a dataset of 60,000 28x28 grayscale images of the 10 digits,
along with a test set of 10,000 images.
More info can be found at the
[MNIST homepage](http://yann.lecun.com/exdb/mnist/).
Arguments:
path: path where to cache the dataset locally
(relative to `~/.keras/datasets`).
Returns:
Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
**x_train, x_test**: uint8 arrays of grayscale image data with shapes
(num_samples, 28, 28).
**y_train, y_test**: uint8 arrays of digit labels (integers in range 0-9)
with shapes (num_samples,).
License:
Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset,
which is a derivative work from original NIST datasets.
MNIST dataset is made available under the terms of the
[Creative Commons Attribution-Share Alike 3.0 license.](
https://creativecommons.org/licenses/by-sa/3.0/)
"""
origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
path = get_file(
path,
origin=origin_folder + 'mnist.npz',
file_hash=
'731c5ac602752760c8e48fbffcf8c3b850d9dc2a2aedcf2cc48468fc17b673d1')
with np.load(path, allow_pickle=True) as f:
x_train, y_train = f['x_train'], f['y_train']
x_test, y_test = f['x_test'], f['y_test']
return (x_train, y_train), (x_test, y_test)
注释里面写得很明白,数据集会被缓存到~/.keras/datasets
,如果加载数据的时候,发现缓存目录有对应文件,并且hash值符合,就会加载缓存文件,否则从网络下载。
那么现在就有两个解决方案:
我采用的是第一种方法,不过为了以防万一(万一网上找半天找不到了),写了一段代码用于根据已有的mnist数据集生成mnist.npz:
import struct
import os
import numpy as np
def load_images(filename):
"""load images
filename: the name of the file containing data
return -- a matrix containing images as row vectors
"""
with open(filename, 'rb') as f:
data = f.read()
magic, num, rows, columns = struct.unpack('>iiii',data[:16])
dimension = rows*columns
X = np.zeros((num,rows,columns), dtype='uint8')
offset = 16
for i in range(num):
a = np.frombuffer(data, dtype=np.uint8, count=dimension, offset=offset)
X[i] = a.reshape((rows, columns))
offset += dimension
return X
def load_labels(filename):
"""load labels
filename: the name of the file containing data
return -- a row vector containing labels
"""
with open(filename,'rb') as f:
data = f.read()
magic, num = struct.unpack('>ii', data[:8])
d = np.frombuffer(data,dtype=np.uint8, count=num, offset=8)
return d;
def load_data(foldername):
"""加载MINST数据集
foldername: the name of the folder containing datasets
return -- train_X, train_y, test_X, test_y
train_X: 训练数据集
train_y: 训练数据集对应的标签
test_X: 测试数据集
test_y: 测试数据集对应的标签
"""
# filenames of datasets
train_X_name = "train-images-idx3-ubyte"
train_y_name = "train-labels-idx1-ubyte"
test_X_name = "t10k-images-idx3-ubyte"
test_y_name = "t10k-labels-idx1-ubyte"
train_X = load_images(os.path.join(foldername,train_X_name))
train_y = load_labels(os.path.join(foldername,train_y_name))
test_X = load_images(os.path.join(foldername, test_X_name))
test_y = load_labels(os.path.join(foldername, test_y_name))
return train_X, train_y, test_X, test_y
def build_mnist_npz(src_folder, dst_folder):
"""
src_folder: 包含mnist数据集文件的目录
dst_folder: 存放输出文件mnist.npz的目录
"""
train_X, train_y, test_X, test_y = load_data(src_folder)
with open(os.path.join(dst_folder, 'mnist.npz'), 'wb') as f:
np.savez_compressed(f, x_test=test_X,x_train=train_X, y_train=train_y,y_test=test_y)
# mnist的四个文件存放在mnist-data中
build_mnist_npz('./mnist-data', '/tmp')