图像检索是研究领域在过去十年中一个非常活跃和快速发展的领域。最著名的系统是Google Image Search和Pinterest Visual Pin Search。在本文中,我们将学习使用一种称为自动编码器的特殊类型的神经网络来构建一个非常简单的图像检索系统。我们将以一种无监督的方式进行操作,即无需查看图像标签。实际上,我们将仅通过使用图像的视觉内容(纹理,形状等)来检索图像。与关键字或基于文本的图像检索相反,这种类型的图像检索称为基于内容的图像检索(CBIR)。
在本文中,我们将使用手写数字图像,MNIST数据集和Keras深度学习框架。
简而言之,自动编码器是旨在将其输入复制到其输出的神经网络。他们通过将输入压缩为一个潜在空间表示,然后从该表示重构输出来进行工作。这种网络由两部分组成:
编码器:这是网络中将输入压缩为潜在空间表示的部分。它可以由编码函数h = f(x)表示。
解码器:这部分旨在从潜在空间表示中重建输入。它可以由解码函数r = g(h)表示。
如果您想了解有关自动编码器的更多信息,建议您阅读 《深入了解:自动编码器》
这种潜在的表示形式或代码正是我们感兴趣的,因为它是发现神经网络压缩每个图像的视觉内容的方式。这意味着所有相似的图像将以相似的方式被编码(希望)。
自动编码器有几种类型,但是由于我们要处理图像,所以最有效的方法是使用卷积自动编码器,它使用卷积层对图像进行编码和解码。
input_img = Input(shape=(28,28,1))
x = Conv2D(16,(3,3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2,2), padding='same')(x)
x = Conv2D(8,(3,3), activation='relu', padding='same')(x)
x = MaxPooling2D((2,2), padding='same')(x)
x = Conv2D(8,(3,3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2,2), padding='same', name='encoder')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
因此,第一步是用我们的训练集训练我们的自动编码器,以使其学习将图像编码为潜在空间表示的方法。
训练完成后,我们只需要网络的编码部分。
encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('encoder').output)
必须在我们的搜索数据库上完成相同的编码,我们要在该数据库中查找与查询图像相似的图像。然后,我们可以将查询代码与数据库代码进行比较,并尝试找到最接近的代码。为了进行比较,我们将使用最近邻技术。
我们将检索最近的代码的方法是通过执行最近邻居算法。最近邻方法背后的原理是找到距离新点最近的预定义数量的样本。距离可以是任何度量标准,但最常见的选择是欧几里得距离。对于尺寸均为n的查询图像q和样本s ,此距离可以通过以下公式计算。
# Fit the NN algorithm to the encoded test set
nbrs = NearestNeighbors(n_neighbors=5).fit(codes)
# Find the closest images to the encoded query image
distances, indices = nbrs.kneighbors(np.array(query_code))
这些是我们检索到的图像,看起来很棒!所有检索到的图像都非常类似于我们的查询图像,并且它们都对应于相同的数字。这表明即使没有显示图像的相应标签,自动编码器也找到了一种以非常相似的方式对相似图像进行编码的方法
Unsupervised Image retrieval
Import the libraries
In [1]:
import numpy as np
from keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras.datasets import mnist
import matplotlib.pyplot as plt
Using TensorFlow backend.
Load the training data
In [2]:
(X_train,_),(X_test,_) = mnist.load_data()
Normalize the data
In [3]:
X_train = X_train.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.
Reshape the data to have 1 channel
In [4]:
print(X_train.shape, X_test.shape)
(60000, 28, 28) (10000, 28, 28)
In [5]:
X_train = np.reshape(X_train, (-1, 28, 28, 1))
X_test = np.reshape(X_test, (-1, 28, 28, 1))
In [6]:
print(X_train.shape, X_test.shape)
(60000, 28, 28, 1) (10000, 28, 28, 1)
Create the autoencoder
In [7]:
input_img = Input(shape=(28,28,1))
x = Conv2D(16,(3,3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2,2), padding='same')(x)
x = Conv2D(8,(3,3), activation='relu', padding='same')(x)
x = MaxPooling2D((2,2), padding='same')(x)
x = Conv2D(8,(3,3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2,2), padding='same', name='encoder')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
Train it
In [8]:
autoencoder.fit(X_train, X_train, epochs=2, batch_size=32, callbacks=None )
Train on 60000 samples, validate on 10000 samples
Epoch 1/2
60000/60000 [==============================] - 85s 1ms/step - loss: 0.1125 - val_loss: 0.1140
Epoch 2/2
60000/60000 [==============================] - 77s 1ms/step - loss: 0.1120 - val_loss: 0.1140
Out[8]:
<keras.callbacks.History at 0x12a58a908>
In [65]:
autoencoder.save('autoencoder.h5')
In [9]:
autoencoder.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 28, 28, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 28, 28, 16) 160
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 16) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 14, 14, 8) 1160
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 8) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 7, 7, 8) 584
_________________________________________________________________
encoder (MaxPooling2D) (None, 4, 4, 8) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 4, 4, 8) 584
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 8, 8, 8) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 8, 8, 8) 584
_________________________________________________________________
up_sampling2d_2 (UpSampling2 (None, 16, 16, 8) 0
_________________________________________________________________
conv2d_6 (Conv2D) (None, 14, 14, 16) 1168
_________________________________________________________________
up_sampling2d_3 (UpSampling2 (None, 28, 28, 16) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 28, 28, 1) 145
=================================================================
Total params: 4,385
Trainable params: 4,385
Non-trainable params: 0
_________________________________________________________________
Create the encoder part
The encoder part is the first half of the autoencoder, i.e. the part that will encode the input into a latent space representation. In this case, the dimension of this representation is $4 \times 4 \times 8$
In [10]:
encoder = Model(inputs=autoencoder.input, outputs=autoencoder.get_layer('encoder').output)
In [66]:
encoder.save('encoder.h5')
Load the query image
We take a query image from the test set
In [11]:
query = X_test[7]
In [12]:
plt.imshow(query.reshape(28,28), cmap='gray')
Out[12]:
<matplotlib.image.AxesImage at 0x139a16320>
Encode the test images and the query image
In [13]:
X_test.shape
Out[13]:
(10000, 28, 28, 1)
We remove the query image from the test set (the set in which we will search for close images)
In [55]:
X_test = np.delete(X_test, 7, axis=0)
In [33]:
X_test.shape
Out[33]:
(9999, 28, 28, 1)
Encode the query image and the test set
In [56]:
codes = encoder.predict(X_test)
In [57]:
query_code = encoder.predict(query.reshape(1,28,28,1))
In [58]:
codes.shape
Out[58]:
(9999, 4, 4, 8)
In [59]:
query_code.shape
Out[59]:
(1, 4, 4, 8)
Find the closest images
In [60]:
from sklearn.neighbors import NearestNeighbors
We will find the 5 closest images
In [89]:
n_neigh = 5
In [90]:
codes = codes.reshape(-1, 4*4*8); print(codes.shape)
query_code = query_code.reshape(1, 4*4*8); print(query_code.shape)
(9999, 128)
(1, 128)
Fit the KNN to the test set
In [91]:
nbrs = NearestNeighbors(n_neighbors=n_neigh).fit(codes)
In [92]:
distances, indices = nbrs.kneighbors(np.array(query_code))
In [93]:
closest_images = X_test[indices]
In [97]:
closest_images = closest_images.reshape(-1,28,28,1); print(closest_images.shape)
(5, 28, 28, 1)
Get the closest images
In [98]:
plt.imshow(query.reshape(28,28), cmap='gray')
Out[98]:
<matplotlib.image.AxesImage at 0x1a436d5ef0>
In [99]:
plt.figure(figsize=(20, 6))
for i in range(n_neigh):
# display original
ax = plt.subplot(1, n_neigh, i+1)
plt.imshow(closest_images[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Reference: