每天进步一点点-tricks

由于正在进行深度学习的研究,主要用的语言是python. 在实际写程序的过程中, 经常会遇到一些技巧性的东西,特此下来来并且不断更新, 如果有任何疑问, 麻烦在下方留言或者联系邮箱 [email protected].

one-hot vector

one-hot vector 在自然语言中处理非常重要, 常作为神经网络的输入, 有indexing的效果. 那么,实际情况中如何建立这样一个矩阵呢. 先考虑小的数据集. 比如有数据标记为两类0,1
one-hot vector is a term in NLP, as its name indicates, it is a vector where only one element is 1 and the others are 0s. Suppose that we have a vocabulary consists of 4000 words for text generation, there should exist 4000 unique one-hot vector for each word. For different tasks, there are different ways to initialize the vectors.

classification
Suppose that there are only 2 classes: 0 and 1. The two one-hot vectors should be [1,0],[0,1]. suppose that we have six learning samples but they are store in an array like [0,1,0,1,1,0], so, we produce an eye matrix first and let the array selects which vector they belong to form a matrix includes all samples.

>>> import numpy as np
>>> x = np.eye(2) # Two types of vectors
>>> y = np.array([0,1,0,1,1,0]) # classes
>>> x
array([[ 1.,  0.],
       [ 0.,  1.]])
>>> y
array([0, 1, 0, 1, 1, 0])
>>> x[y] # By indexing, we generate a matrix for learning
array([[ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.]])

float32 (theano)

The default floating point data type is float64, however, data must be tranferred to float32 to store in the GPU.

convert to float32

epilson = np.float32(0.01)

use shared statement

import theano
import theano.tensor as T
w = theano.shared((np.random.randn(input_dimension,output_dimension).astype('float32'), name='w')

MNIST dataset

The MNIST dataset is a universally-used dataset for digit recognition, its characters can be summed up as the following:

train set:50,000, validation set:10,000,test set:10,000
28 x 28 pixels (each training example is represented as a 1-dimensional array whose length is 784.
Now, we begin with opening the dataset in Python and try to optimize it to be used for GPU acceleration.
```
 import cPickle, gzip, numpy, theano
```

 
 Load the dataset 
 f = gzip.open('mnist.pkl.gz', 'rb')
 train_set, valid_set, test_set = cPickle.load(f)
 f.close() 
 Next, store the data into GPU memory 
 def share_dataset(data_xy):
 # use theano shared value form
 data_x, data_y = data_xy
 shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
 shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
 '''
 Can also use the following syntax, it also works!
 shared_x = theano.shared(data_x.astype('float32'))
 shared_y = theano.shared(data_y.astype('float32'))
 '''
 # Since 'Y' should be intergers, not floats, we cast it
 return shared_x, T.cast(shared_y, 'int32') 
 Now try it! 
 test_set_x, test_set_y = share_dataset(test_set)
 valid_set_x, valid_set_y = share_dataset(valid_set)
 train_set_x, train_set_y = share_dataset(train_set)

代码块语法遵循标准markdown代码，例如：python@requires_authorizationdef somefunc(param1='', param2=0): '''A docstring''' if param1 > param2: # interesting print 'Greater' return (param2 - param1 + 1) or Noneclass SomeClass: pass>>> message = '''interpreter... prompt'''

每天进步一点点-tricks

one-hot vector

float32 (theano)

MNIST dataset

Load the dataset

Next, store the data into GPU memory

Now try it!

你可能感兴趣的:(每天进步一点点-tricks)