I’m keen to explore some challenges in multimodal learning, such as jointly learning visual and textual semantics. However, I would rather not start by attempting to train an image recognition system from scratch, and prefer to leave this part to researchers who are more experienced in vision and image analysis.
Therefore, the goal is to use an existing image recognition system, in order to extract useful features for a dataset of images, which can then be used as input to a separate machine learning system or neural network. We start with a directory of images, and create a text file containing feature vectors for each image.
Caffe is an open-source neural network library developed in Berkeley, with a focus on image recognition. It can be used to construct and train your own network, or load one of the pretrained models. A web demo is available if you want to test it out.
Follow the installation instructions to compile Caffe. You will need to install quite a few dependencies (Boost, OpenCV, ATLAS, etc), but at least for Ubuntu 14.04 they were all available in public repositories.
Once you’re done, run
1
2
|
make
test
make
runtest
|
This will run the tests and make sure the installation is working properly.
Put all your images you want to process into one directory. Then generate a file containing the path to each image. One image per line. We will use this file to read the images, and it will help you map images to the correct vectors later.
You can run something like this:
1
|
find
`
pwd
`
/images
-
type
f -
exec
echo
{} \; > images.txt
|
This will find all files in subdirectory called “images” and write their paths to images.txt
There are a number of pretrained models publically available for Caffe. Four main models are part of the original Caffe distribution, but more are available in the Model Zoo wiki page, provided by community members and other researchers.
We’ll be using the BVLC GoogLeNet model, which is based on the model described in Going Deeper with Convolutions by Szegedy et al. (2014). It is a 22-layer deep convolutional network, trained on ImageNet data to detect 1,000 different image types. Just for fun, here’s a diragram of the network, rotated 90 degrees:
The Caffe models consist of two parts:
The prototxt files are small, and they came included with the Caffe code. But the parameters are large and need to be downloaded separately. Run the following command in your main Caffe directory to download the parameters for the GoogLeNet model:
1
|
python scripts
/download_model_binary
.py models
/bvlc_googlenet
|
This will find out where to download the caffemodel file, based on information already in the models/bvlc_googlenet/ directory, and will then place it into the same directory.
In addition, run this command as well:
1
|
.
/data/ilsvrc12/get_ilsvrc_aux
.sh
|
It will download some auxiliary files for the ImageNet dataset, including the file of class labels which we will be using later.
Now is the time to load the model into Caffe, process each image, and print a corresponding vector into a file. I created a script for that (see below, also available as a Gist):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
|
import
numpy as np
import
os, sys, getopt
# Main path to your caffe installation
caffe_root
=
'/path/to/your/caffe/'
# Model prototxt file
model_prototxt
=
caffe_root
+
'models/bvlc_googlenet/deploy.prototxt'
# Model caffemodel file
model_trained
=
caffe_root
+
'models/bvlc_googlenet/bvlc_googlenet.caffemodel'
# File containing the class labels
imagenet_labels
=
caffe_root
+
'data/ilsvrc12/synset_words.txt'
# Path to the mean image (used for input processing)
mean_path
=
caffe_root
+
'python/caffe/imagenet/ilsvrc_2012_mean.npy'
# Name of the layer we want to extract
layer_name
=
'pool5/7x7_s1'
sys.path.insert(
0
, caffe_root
+
'python'
)
import
caffe
def
main(argv):
inputfile
=
''
outputfile
=
''
try
:
opts, args
=
getopt.getopt(argv,
"hi:o:"
,[
"ifile="
,
"ofile="
])
except
getopt.GetoptError:
print
'caffe_feature_extractor.py -i <inputfile> -o <outputfile>'
sys.exit(
2
)
for
opt, arg
in
opts:
if
opt
=
=
'-h'
:
print
'caffe_feature_extractor.py -i <inputfile> -o <outputfile>'
sys.exit()
elif
opt
in
(
"-i"
):
inputfile
=
arg
elif
opt
in
(
"-o"
):
outputfile
=
arg
print
'Reading images from "'
, inputfile
print
'Writing vectors to "'
, outputfile
# Setting this to CPU, but feel free to use GPU if you have CUDA installed
caffe.set_mode_cpu()
# Loading the Caffe model, setting preprocessing parameters
net
=
caffe.Classifier(model_prototxt, model_trained,
mean
=
np.load(mean_path).mean(
1
).mean(
1
),
channel_swap
=
(
2
,
1
,
0
),
raw_scale
=
255
,
image_dims
=
(
256
,
256
))
# Loading class labels
with
open
(imagenet_labels) as f:
labels
=
f.readlines()
# This prints information about the network layers (names and sizes)
# You can uncomment this, to have a look inside the network and choose which layer to print
#print [(k, v.data.shape) for k, v in net.blobs.items()]
#exit()
# Processing one image at a time, printint predictions and writing the vector to a file
with
open
(inputfile,
'r'
) as reader:
with
open
(outputfile,
'w'
) as writer:
writer.truncate()
for
image_path
in
reader:
image_path
=
image_path.strip()
input_image
=
caffe.io.load_image(image_path)
prediction
=
net.predict([input_image], oversample
=
False
)
print
os.path.basename(image_path),
' : '
, labels[prediction[
0
].argmax()].strip() ,
' ('
, prediction[
0
][prediction[
0
].argmax()] ,
')'
np.savetxt(writer, net.blobs[layer_name].data[
0
].reshape(
1
,
-
1
), fmt
=
'%.8g'
)
if
__name__
=
=
"__main__"
:
main(sys.argv[
1
:])
|
You will first need to set the caffe_root variable to point to your Caffe installation. Then run it with:
1
|
python caffe_feature_extractor.py -i <inputfile> -o <outputfile>
|
It will first print out a lot of model-specific debugging information, and will then print a line for each input image containing the image name, the label of the most probable class, and the class probability.
1
2
3
|
flower.jpg : n11939491 daisy ( 0.576037 )
horse.jpg : n02389026 sorrel ( 0.996444 )
beach.jpg : n09428293 seashore, coast, seacoast, sea-coast ( 0.568305 )
|
At the same time, it will also print vectors into the output file. By default, it will extract the layer pool5/7x7_s1 after processing each image. This is the last layer before the final softmax in the end, and it contains 1024 elements. I haven’t experimented with choosing different layers yet, but this seemed like a reasonable place to start – it should contain all the high-level processing done in the network, but before forcing it to choose a specific class. Feel free to choose a different layer though, just change the corresponding parameter in the script. If you find that specific layers work better, let me know as well.
The outputfile will contain vectors for each image. There will be one line of values for each input image, and every line will contain 1024 values (if you printed the default layer). Mission accomplished!
There you have it – going from images to vectors. Now you can use these vectors to represent your images in various tasks, such as classification, multi-modal learning, or clustering. Ideally, you will probably want to train the whole network on a specific task, including the visual component, but for starters these pretrained vectors should be quite helpful as well.
These instructions and the script are loosely based on Caffe examples on ImageNet classification and filter visualisation. If the code here isn’t doing quite what you want it to, it’s worth looking at these other similar applications.
If you have any suggestions or fixes, let me know and I’ll be happy to incorporate them in this post.