TRANSFORMING IMAGES TO FEATURE VECTORS

TRANSFORMING IMAGES TO FEATURE VECTORS

I’m keen to explore some challenges in multimodal learning, such as jointly learning visual and textual semantics. However, I would rather not start by attempting to train an image recognition system from scratch, and prefer to leave this part to researchers who are more experienced in vision and image analysis.

Therefore, the goal is to use an existing image recognition system, in order to extract useful features for a dataset of images, which can then be used as input to a separate machine learning system or neural network. We start with a directory of images, and create a text file containing feature vectors for each image.

1. Install Caffe

Caffe is an open-source neural network library developed in Berkeley, with a focus on image recognition. It can be used to construct and train your own network, or load one of the pretrained models. A web demo is available if you want to test it out.

Follow the installation instructions to compile Caffe. You will need to install quite a few dependencies (Boost, OpenCV, ATLAS, etc), but at least for Ubuntu 14.04 they were all available in public repositories.

Once you’re done, run

1
2
make test
make runtest

This will run the tests and make sure the installation is working properly.

2. Prepare your dataset

Put all your images you want to process into one directory. Then generate a file containing the path to each image. One image per line. We will use this file to read the images, and it will help you map images to the correct vectors later.

You can run something like this:

1
find ` pwd ` /images - type f - exec echo {} \; > images.txt

This will find all files in subdirectory called “images” and write their paths to images.txt

3. Download the model

There are a number of pretrained models publically available for Caffe. Four main models are part of the original Caffe distribution, but more are available in the Model Zoo wiki page, provided by community members and other researchers.

We’ll be using the BVLC GoogLeNet model, which is based on the model described in Going Deeper with Convolutions by Szegedy et al. (2014). It is a 22-layer deep convolutional network, trained on ImageNet data to detect 1,000 different image types. Just for fun, here’s a diragram of the network, rotated 90 degrees:

TRANSFORMING IMAGES TO FEATURE VECTORS_第1张图片

The Caffe models consist of two parts:

  1. A description of the model (in the form of *.prototxt files)
  2. The trained parameters of the model (in the form of a *.caffemodel file)

The prototxt files are small, and they came included with the Caffe code. But the parameters are large and need to be downloaded separately. Run the following command in your main Caffe directory to download the parameters for the GoogLeNet model:

1
python scripts /download_model_binary .py models /bvlc_googlenet

This will find out where to download the caffemodel file, based on information already in the models/bvlc_googlenet/ directory, and will then place it into the same directory.

In addition, run this command as well:

1
. /data/ilsvrc12/get_ilsvrc_aux .sh

It will download some auxiliary files for the ImageNet dataset, including the file of class labels which we will be using later.

4. Process images and print vectors

Now is the time to load the model into Caffe, process each image, and print a corresponding vector into a file. I created a script for that (see below, also available as a Gist):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
import os, sys, getopt
 
# Main path to your caffe installation
caffe_root = '/path/to/your/caffe/'
 
# Model prototxt file
model_prototxt = caffe_root + 'models/bvlc_googlenet/deploy.prototxt'
 
# Model caffemodel file
model_trained = caffe_root + 'models/bvlc_googlenet/bvlc_googlenet.caffemodel'
 
# File containing the class labels
imagenet_labels = caffe_root + 'data/ilsvrc12/synset_words.txt'
 
# Path to the mean image (used for input processing)
mean_path = caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy'
 
# Name of the layer we want to extract
layer_name = 'pool5/7x7_s1'
 
sys.path.insert( 0 , caffe_root + 'python' )
import caffe
 
def main(argv):
     inputfile = ''
     outputfile = ''
 
     try :
         opts, args = getopt.getopt(argv, "hi:o:" ,[ "ifile=" , "ofile=" ])
     except getopt.GetoptError:
         print 'caffe_feature_extractor.py -i <inputfile> -o <outputfile>'
         sys.exit( 2 )
 
     for opt, arg in opts:
         if opt = = '-h' :
             print 'caffe_feature_extractor.py -i <inputfile> -o <outputfile>'
             sys.exit()
         elif opt in ( "-i" ):
             inputfile = arg
         elif opt in ( "-o" ):
             outputfile = arg
 
     print 'Reading images from "' , inputfile
     print 'Writing vectors to "' , outputfile
 
     # Setting this to CPU, but feel free to use GPU if you have CUDA installed
     caffe.set_mode_cpu()
     # Loading the Caffe model, setting preprocessing parameters
     net = caffe.Classifier(model_prototxt, model_trained,
                            mean = np.load(mean_path).mean( 1 ).mean( 1 ),
                            channel_swap = ( 2 , 1 , 0 ),
                            raw_scale = 255 ,
                            image_dims = ( 256 , 256 ))
 
     # Loading class labels
     with open (imagenet_labels) as f:
         labels = f.readlines()
 
     # This prints information about the network layers (names and sizes)
     # You can uncomment this, to have a look inside the network and choose which layer to print
     #print [(k, v.data.shape) for k, v in net.blobs.items()]
     #exit()
 
     # Processing one image at a time, printint predictions and writing the vector to a file
     with open (inputfile, 'r' ) as reader:
         with open (outputfile, 'w' ) as writer:
             writer.truncate()
             for image_path in reader:
                 image_path = image_path.strip()
                 input_image = caffe.io.load_image(image_path)
                 prediction = net.predict([input_image], oversample = False )
                 print os.path.basename(image_path), ' : ' , labels[prediction[ 0 ].argmax()].strip() , ' (' , prediction[ 0 ][prediction[ 0 ].argmax()] , ')'
                 np.savetxt(writer, net.blobs[layer_name].data[ 0 ].reshape( 1 , - 1 ), fmt = '%.8g' )
 
if __name__ = = "__main__" :
     main(sys.argv[ 1 :])

You will first need to set the caffe_root variable to point to your Caffe installation. Then run it with:

1
python caffe_feature_extractor.py -i <inputfile> -o <outputfile>

It will first print out a lot of model-specific debugging information, and will then print a line for each input image containing the image name, the label of the most probable class, and the class probability.

1
2
3
flower.jpg  :  n11939491 daisy  ( 0.576037 )
horse.jpg  :  n02389026 sorrel  ( 0.996444 )
beach.jpg  :  n09428293 seashore, coast, seacoast, sea-coast  ( 0.568305 )

At the same time, it will also print vectors into the output file. By default, it will extract the layer pool5/7x7_s1 after processing each image. This is the last layer before the final softmax in the end, and it contains 1024 elements. I haven’t experimented with choosing different layers yet, but this seemed like a reasonable place to start – it should contain all the high-level processing done in the network, but before forcing it to choose a specific class. Feel free to choose a different layer though, just change the corresponding parameter in the script. If you find that specific layers work better, let me know as well.

The outputfile will contain vectors for each image. There will be one line of values for each input image, and every line will contain 1024 values (if you printed the default layer). Mission accomplished!

Epilogue

There you have it – going from images to vectors. Now you can use these vectors to represent your images in various tasks, such as classification, multi-modal learning, or clustering. Ideally, you will probably want to train the whole network on a specific task, including the visual component, but for starters these pretrained vectors should be quite helpful as well.

These instructions and the script are loosely based on Caffe examples on ImageNet classification and filter visualisation. If the code here isn’t doing quite what you want it to, it’s worth looking at these other similar applications.

If you have any suggestions or fixes, let me know and I’ll be happy to incorporate them in this post.

你可能感兴趣的:(transform)