tensorflow2.3手动下载mnist.npz数据集

简介

本文介绍了在Ubuntu下如何手动下载mnist.npz数据集并且放到指定位置供tensorflow使用。
tensorflow版本:2.3

解决办法

先写解决办法:

  1. 下载,百度网盘下载链接https://pan.baidu.com/s/1jH6uFFC 密码: dw3d。不是我的网盘,感谢这位大哥的写的博文以及其中的网盘连接。
  2. 下载完成之后放到~/.keras/dataset/目录下,命名为mnist.npz

完成。

解决过程

在使用tensorflow的加载mnist的时候,发现需要从https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz这个网址下载,一看这个域名就不可能成功嘛!
想着我看看能不能把我之前下好的mnist数据集用上,于是看了下加载mnist数据集的源码,如下:

# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""MNIST handwritten digits dataset.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

from tensorflow.python.keras.utils.data_utils import get_file
from tensorflow.python.util.tf_export import keras_export


@keras_export('keras.datasets.mnist.load_data')
def load_data(path='mnist.npz'):
  """Loads the [MNIST dataset](http://yann.lecun.com/exdb/mnist/).

  This is a dataset of 60,000 28x28 grayscale images of the 10 digits,
  along with a test set of 10,000 images.
  More info can be found at the
  [MNIST homepage](http://yann.lecun.com/exdb/mnist/).


  Arguments:
      path: path where to cache the dataset locally
          (relative to `~/.keras/datasets`).

  Returns:
      Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.

      **x_train, x_test**: uint8 arrays of grayscale image data with shapes
        (num_samples, 28, 28).

      **y_train, y_test**: uint8 arrays of digit labels (integers in range 0-9)
        with shapes (num_samples,).

  License:
      Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset,
      which is a derivative work from original NIST datasets.
      MNIST dataset is made available under the terms of the
      [Creative Commons Attribution-Share Alike 3.0 license.](
      https://creativecommons.org/licenses/by-sa/3.0/)
  """
  origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
  path = get_file(
      path,
      origin=origin_folder + 'mnist.npz',
      file_hash=
      '731c5ac602752760c8e48fbffcf8c3b850d9dc2a2aedcf2cc48468fc17b673d1')
  with np.load(path, allow_pickle=True) as f:
    x_train, y_train = f['x_train'], f['y_train']
    x_test, y_test = f['x_test'], f['y_test']

    return (x_train, y_train), (x_test, y_test)

注释里面写得很明白,数据集会被缓存到~/.keras/datasets,如果加载数据的时候,发现缓存目录有对应文件,并且hash值符合,就会加载缓存文件,否则从网络下载。

那么现在就有两个解决方案:

  1. 到网上找mnist.npz,这个mnist.npz的sha256sum是源代码中的hash值,然后放入到缓存目录。这个方法不用修改代码。
  2. 自己构建mnist.npz,计算hash值,然后修改代码中的hash值,把mnist.npz放入到缓存目录。

我采用的是第一种方法,不过为了以防万一(万一网上找半天找不到了),写了一段代码用于根据已有的mnist数据集生成mnist.npz:

import struct
import os
import numpy as np

def load_images(filename):
    """load images
    filename: the name of the file containing data
    return -- a matrix containing images as row vectors
    """
    with open(filename, 'rb') as f:
        data = f.read()

    magic, num, rows, columns = struct.unpack('>iiii',data[:16])

    dimension = rows*columns
    
    X = np.zeros((num,rows,columns), dtype='uint8')

    offset = 16
    for i in range(num):
        a = np.frombuffer(data, dtype=np.uint8, count=dimension, offset=offset)
        X[i] = a.reshape((rows, columns))
        offset += dimension

    return X


def load_labels(filename):
    """load labels
    filename: the name of the file containing data
    return -- a row vector containing labels
    """
    with open(filename,'rb') as f:
        data = f.read()

    magic, num = struct.unpack('>ii', data[:8])

    d = np.frombuffer(data,dtype=np.uint8, count=num, offset=8)

    return d;

def load_data(foldername):
    """加载MINST数据集
    foldername: the name of the folder containing datasets
    return -- train_X, train_y, test_X, test_y
    train_X: 训练数据集
    train_y: 训练数据集对应的标签
    test_X: 测试数据集
    test_y: 测试数据集对应的标签
    """

    # filenames of datasets
    train_X_name = "train-images-idx3-ubyte"
    train_y_name = "train-labels-idx1-ubyte"
    test_X_name = "t10k-images-idx3-ubyte"
    test_y_name = "t10k-labels-idx1-ubyte"

    train_X = load_images(os.path.join(foldername,train_X_name))
    train_y = load_labels(os.path.join(foldername,train_y_name))
    test_X = load_images(os.path.join(foldername, test_X_name))
    test_y = load_labels(os.path.join(foldername, test_y_name))

    return train_X, train_y, test_X, test_y


def build_mnist_npz(src_folder, dst_folder):
    """
    src_folder: 包含mnist数据集文件的目录
    dst_folder: 存放输出文件mnist.npz的目录
    """

    train_X, train_y, test_X, test_y = load_data(src_folder)
    with open(os.path.join(dst_folder, 'mnist.npz'), 'wb') as f:
        np.savez_compressed(f, x_test=test_X,x_train=train_X, y_train=train_y,y_test=test_y)

# mnist的四个文件存放在mnist-data中
build_mnist_npz('./mnist-data', '/tmp')

你可能感兴趣的:(Python)