上一篇博客介绍了如何使用Theano+logistic regression来实现kaggle上的数字手写识别,文末提到了CPU计算实在太慢,因此在做完这个实验之后,博主查阅了Theano的文档,了解到Theano官方仅支持CUDA进行GPU运算,不支持OpenCL,也就是说Theano官方仅支持N卡。原因是,CUDA和OpenCL是两个GPU计算平台,CUDA仅支持N卡,OpenCL支持所有的显卡,二者的具体区别还请自行查询。无奈博主的笔记本有一张intel的集成显卡和AMD的一张入门独显,而Theano非官方的提供了libgpuarray来支持OpenCL,因此博主花了大量的时间来尝试安装libgpuarray。

libgpuarray支持的OS有Debian6,Ubuntu14.04,MAC OS X10.11和win7,而网上能找到的成功安装libgpuarray的只有两篇博文,全是在MAC OS上,下面给出博文链接,供后面的同学参考:


  • 最新的AMD显卡驱动,具体可前往AMD官网查询
  • AMD APP SDK,其提供了OpenCL
  • Cmake >= 3.0 (cmake)
  • g++,一般我们可以通过wingw或TDW-GCC来安装
  • visual studio
  • clBLAS (clblas)
  • libcheck



我的win7/Ubuntu14.04双系统安装过程参考了http://m.blog.csdn.net/article/details?id=43987599 这篇博文比较简单,这里不再展开。





marcovaldo@marcovaldong:~$ fglrxinfo
display: :0  screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon HD 7400M Series
OpenGL version string: 4.5.13399 Compatibility Profile Context 15.201.1151



前往AMD官网下载SDK(注意OS和位数),我这里下载的是Linux64位版AMD APP SDK 3.0。文件解压后出现一个.sh文件,终端输入命令

sudo sh AMD-APP-SDK-v3.0.130.136-GA-linux64.sh


marcovaldo@marcovaldong:~$ clinfo
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (1800.11)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    AMD Radeon HD 7400M Series
  Device Topology:               PCI[ B#1, D#0, F#0 ]
  Max compute units:                 2
  Max work items dimensions:             3
    Max work items[0]:               256
    Max work items[1]:               256
    Max work items[2]:               256
  Max work group size:               256
  Preferred vector width char:           16
  Preferred vector width short:          8
  Preferred vector width int:            4
  Preferred vector width long:           2
  Preferred vector width float:          4
  Preferred vector width double:         0
  Native vector width char:          16
  Native vector width short:             8
  Native vector width int:           4
  Native vector width long:          2
  Native vector width float:             4
  Native vector width double:            0
  Max clock frequency:               700Mhz
  Address bits:                  32
  Max memory allocation:             134217728
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     No
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    None
  Cache line size:               0
  Cache size:                    0
  Global memory size:                536870912
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 32768
  Max pipe arguments:                0
  Max pipe active reservations:          0
  Max pipe packet size:              0
  Max global variable size:          0
  Max global variable preferred total size:  0
  Max read/write image args:             0
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:              
    Coarse grain buffer:             No
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x7f98e6833430
  Name:                      Caicos
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                1800.11
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 AMD-APP (1800.11)
  Extensions:                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_amd_image2d_from_buffer_read_only cl_khr_spir cl_khr_gl_event 

  Device Type:                   CL_DEVICE_TYPE_CPU
  Vendor ID:                     1002h
  Board name:                    
  Max compute units:                 4
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               1024
  Preferred vector width char:           16
  Preferred vector width short:          8
  Preferred vector width int:            4
  Preferred vector width long:           2
  Preferred vector width float:          8
  Preferred vector width double:         4
  Native vector width char:          16
  Native vector width short:             8
  Native vector width int:           4
  Native vector width long:          2
  Native vector width float:             8
  Native vector width double:            4
  Max clock frequency:               2299Mhz
  Address bits:                  64
  Max memory allocation:             2147483648
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      64
  Max image 2D width:                8192
  Max image 2D height:               8192
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            16
  Max size of kernel argument:           4096
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    32768
  Global memory size:                6161788928
  Constant buffer size:              65536
  Max number of constant args:           8
  Local memory type:                 Global
  Local memory size:                 32768
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              2147483648
  Max global variable size:          1879048192
  Max global variable preferred total size:  1879048192
  Max read/write image args:             64
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:              
    Coarse grain buffer:             No
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     1
  Error correction support:          0
  Unified memory for Host and Device:        1
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             Yes
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x7f98e6833430
  Name:                      Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz
  Vendor:                    GenuineIntel
  Device OpenCL C version:           OpenCL C 1.2 
  Driver version:                1800.11 (sse2,avx)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 1.2 AMD-APP (1800.11)
  Extensions:                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event 


export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/opt/AMDAPPSDK-3.0/lib/x86_64":"/opt/AMDAPPSDK-3.0/lib/x86"

到这里,AMD APP SDK就算是安装好了,下面再给出我参考的几篇博文:



sudo add-apt-repository ppa:fkrull/deadsnakes-python2.7
sudo apt-get update  
sudo apt-get upgrade



sudo apt-get install python-virtualenv
sudo apt-get install python-pip
virtualenv venv
source venv/bin/activate


pip install numpy
pip install Cython
pip install Scipy



pip install git+https://github.com/Theano/Theano.git
# 这里我使用的是robberphex的CSDN镜像,在此表示感谢
# pip install git+https://code.csdn.net/u010096836/theano.git


sudo apt-get install check


git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray
mkdir Build
cd Build
make install 

export LIBRARY_PATH=$LIBRARY_PATH:$PWD/../venv/lib
export CPATH=$CPATH:$PWD/../venv/

python setup.py build
python setup.py install


from theano import function, config, shared, tensor, sandbox
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
    print('Used the gpu')


(venv)marcovaldo@marcovaldong:~/desktop$ python test.py
Looping 1000 times took 7.7898850441 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
Used the cpu


(venv)marcovaldo@marcovaldong:~/desktop$ THEANO_FLAGS=mode=FAST_RUN,floatX=float32 python test.py
Looping 1000 times took 3.86811089516 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
Used the cpu
(venv)marcovaldo@marcovaldong:~/desktop$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python test.py
Looping 1000 times took 3.84727883339 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
Used the cpu


(venv)marcovaldo@marcovaldong:~/desktop$ THEANO_FLAGS=mode=FAST_RUN,device=opencl0:0,floatX=float32 python test.py
ERROR (theano.sandbox.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
  File "/home/marcovaldo/myvenv/venv/local/lib/python2.7/site-packages/theano/sandbox/gpuarray/__init__.py", line 96, in 
  File "/home/marcovaldo/myvenv/venv/local/lib/python2.7/site-packages/theano/sandbox/gpuarray/__init__.py", line 47, in init_dev
    "Make sure Theano and libgpuarray/pygpu "
RuntimeError: ('Wrong major API version for gpuarray:', -9997, 'Make sure Theano and libgpuarray/pygpu are in sync.')
Looping 1000 times took 3.86138486862 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
Used the cpu


RuntimeError: ('Wrong major API version for gpuarray:', -9997, 'Make sure Theano and libgpuarray/pygpu are in sync.')
RuntimeError: ('Wrong major API version for gpuarray:', -9998, 'Make sure Theano and libgpuarray/pygpu are in sync.')





