由于正在进行深度学习的研究,主要用的语言是python. 在实际写程序的过程中, 经常会遇到一些技巧性的东西,特此下来来并且不断更新, 如果有任何疑问, 麻烦在下方留言或者联系邮箱 [email protected].
one-hot vector
one-hot vector 在自然语言中处理非常重要, 常作为神经网络的输入, 有indexing的效果. 那么,实际情况中如何建立这样一个矩阵呢. 先考虑小的数据集. 比如有数据标记为两类0,1
one-hot vector is a term in NLP, as its name indicates, it is a vector where only one element is 1 and the others are 0s. Suppose that we have a vocabulary consists of 4000 words for text generation, there should exist 4000 unique one-hot vector for each word. For different tasks, there are different ways to initialize the vectors.
- classification
Suppose that there are only 2 classes: 0 and 1. The two one-hot vectors should be[1,0],[0,1]
. suppose that we have six learning samples but they are store in an array like[0,1,0,1,1,0]
, so, we produce an eye matrix first and let the array selects which vector they belong to form a matrix includes all samples.
>>> import numpy as np
>>> x = np.eye(2) # Two types of vectors
>>> y = np.array([0,1,0,1,1,0]) # classes
>>> x
array([[ 1., 0.],
[ 0., 1.]])
>>> y
array([0, 1, 0, 1, 1, 0])
>>> x[y] # By indexing, we generate a matrix for learning
array([[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 1., 0.]])
float32 (theano)
The default floating point data type is float64, however, data must be tranferred to float32 to store in the GPU.
- convert to float32
epilson = np.float32(0.01)
- use shared statement
import theano
import theano.tensor as T
w = theano.shared((np.random.randn(input_dimension,output_dimension).astype('float32'), name='w')
MNIST dataset
The MNIST dataset is a universally-used dataset for digit recognition, its characters can be summed up as the following:
- train set:
50,000
, validation set:10,000
,test set:10,000
-
28 x 28
pixels (each training example is represented as a 1-dimensional array whose length is784
.
Now, we begin with opening the dataset in Python and try to optimize it to be used for GPU acceleration.
import cPickle, gzip, numpy, theano
Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
Next, store the data into GPU memory
def share_dataset(data_xy):
# use theano shared value form
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
'''
Can also use the following syntax, it also works!
shared_x = theano.shared(data_x.astype('float32'))
shared_y = theano.shared(data_y.astype('float32'))
'''
# Since 'Y' should be intergers, not floats, we cast it
return shared_x, T.cast(shared_y, 'int32')
Now try it!
test_set_x, test_set_y = share_dataset(test_set)
valid_set_x, valid_set_y = share_dataset(valid_set)
train_set_x, train_set_y = share_dataset(train_set)
代码块语法遵循标准markdown代码,例如:python@requires_authorizationdef somefunc(param1='', param2=0): '''A docstring''' if param1 > param2: # interesting print 'Greater' return (param2 - param1 + 1) or Noneclass SomeClass: pass>>> message = '''interpreter... prompt'''