python加速

Preface:许久没有更新博客了,把老夫以往整理的技术相关,再整理下。。。在遇到大规模数据处理时,计算资源受到制约,为此需要进行各种加速方法。数值计算加速方法有cpython、numba等,但如大规模分词,NLP相关的处理,对文本进行处理,则不太方便,只能采用多线程、多进程的方式进行加速。。。

目录

一、cpython计算加速

二、multiprocessing子进程加速v1

三、multiprocessing子进程加速v2

四、numba数值计算加速

五、爬虫多线程加速


一、cpython计算加速

  • https://zhuanlan.zhihu.com/p/24168485

二、multiprocessing子进程加速v1

  • 多个子进程,8(multiprocessing.cpu_count())核cpu,极限能快到8倍。
    • import multiprocessing
      from multiprocessing import Pool
      import os, time, random
      
      def do_anything(subprocesse_id):
          # 不仅仅计算,数值计算加速可用numba、cpython,针对非数值计算
          result = 0
          for x in range(100):
              for j in range(50):
                  for k in range(10):
                      result = x + j + k + subprocesse_id
          return result
      
      # 普通循环执行
      cpu_count = multiprocessing.cpu_count()
      list1 = range(500)
      start = time.time()
      for i in list1:
          result = do_anything(i)
      end = time.time()
      print end - start
      
      # 根据多个子进程加速,最后通过get得到返回值
      start = time.time()
      p = Pool(cpu_count+1) # 并发数为cpu_count+1
      result = []
      for i in range(500):  # 500个子进程
          result.append(p.apply_async(do_anything, args=(i,)))
          # 切不可在此使用get,get会产生阻塞,相当于普通用法
      p.close()
      p.join()
      for i in result:
          res = i.get()
      end = time.time()
      print end - start

       

    • 参考:

      • https://thief.one/2016/11/24/Multiprocessing子进程返回值/
      • https://www.liaoxuefeng.com/wiki/897692888725344/923056295693632

三、multiprocessing子进程加速v2

  • 使用共享变量Manager,管理多个子进程公用的变量
    • # 使用共享变量的方式
      from multiprocessing import Manager
      def worker(procnum, return_dict):
          '''worker function'''
          print str(procnum) + ' represent!'
          return_dict[procnum] = procnum
      
      def do_anything(procnum, return_dict):
          # 不仅仅计算,数值计算加速可用numba、cpython,针对非数值计算
          result = 0
          for x in range(100):
              for j in range(50):
                  for k in range(10):
                      result = x + j + k + procnum
          return_dict[procnum] = procnum
      
      manager = Manager()
      return_dict = manager.dict()
      jobs = []
      for i in range(5):
          p = multiprocessing.Process(target=worker, args=(i,return_dict))
          jobs.append(p)
          p.start()
      
      for proc in jobs:
          proc.join()
      # 最后的结果是多个进程返回值的集合
      print return_dict

       

  • 参考:
    • https://segmentfault.com/q/1010000010403117
    • https://blog.csdn.net/haeasringnar/article/details/79917003

四、numba数值计算加速

  • 原理:
    • numba的通过meta模块解析Python函数的ast语法树,对各个变量添加相应的类型信息。然后调用llvmpy生成机器码,最后再生成机器码的Python调用接口。Numba项目能够将处理NumPy数组的Python函数JIT编译为机器码执行,从而上百倍的提高程序的运算速度。numba中提供了一些修饰器,它们可以将其修饰的函数JIT编译成机器码函数,并返回一个可在Python中调用机器码的包装对象。为了能将Python函数编译成能高速执行的机器码,我们需要告诉JIT编译器函数的各个参数和返回值的类型。如果希望JIT能针对所有类型的参数进行运算,可以使用autojit。

    • 参考:https://zhuanlan.zhihu.com/p/33556376

  • 支持类型:
    • print [obj for obj in nb.__dict__.values() if isinstance(obj, nb.minivect.minitypes.Type)]

      [size_t, Py_uintptr_t, uint16, complex128, float, complex256, void, int , long double,
       unsigned PY_LONG_LONG, uint32, complex256, complex64, object_, npy_intp, const char *,
       double, unsigned short, float, object_, float, uint64, uint32, uint8, complex128, uint16,
       int, int , uint8, complex64, int8, uint64, double, long double, int32, double, long double,
       char, long, unsigned char, PY_LONG_LONG, int64, int16, unsigned long, int8, int16, int32,
       unsigned int, short, int64, Py_ssize_t]

  • 例子:
    • from numba import  jit
      import time
      # 普通计算
      def foo1(x,y):
              tt = time.time()
              s = 0
              for i in range(x,y):
                      s += i
              print('Time used: {} sec'.format(time.time()-tt))
              return s
      print(foo1(1,100000000))
      
      # 使用numba加速
      @jit
      def foo2(x,y):
              tt = time.time()
              s = 0
              for i in range(x,y):
              	s += i
              print('Time used: {} sec'.format(time.time()-tt))
              return s
      print(foo2(1,100000000))

       

五、爬虫多线程加速

  • 爬取贴吧的多线程:仅限于爬虫的多线程加速,因为爬虫IO等待时间长,为此,多线程可破。

    参考我的博客:https://blog.csdn.net/u010454729/article/details/49765929

  • #!/usr/bin/env python
    # coding=utf-8
    from multiprocessing.dummy import Pool as ThreadPool
    import requests
    import time
     
    def getsource(url):
        html = requests.get(url)
     
    urls = []
    for i in range(1,21):
        newpage = "http://tieba.baidu.com/p/3522395718?pn=" + str(i)
        urls.append(newpage)#构造url列表
     
    time1 = time.time()
    for i in urls:
        print i
        getsource(i)
    time2 = time.time()
    print u"单线程耗时:"+str(time2-time1)
     
    pool = ThreadPool(4)#机器是多少核便填多少,卤煮实在ubuntu14.04 4核戴尔电脑上跑的程序
    time3 = time.time()
    results = pool.map(getsource, urls)
    pool.close()
    pool.join()
    time4 = time.time()
    print u"并行耗时:"+str(time4-time3)

    错误的加速例子:

    • from multiprocessing.dummy import Pool as ThreadPool
      import os
      
      # 准备函数
      def getWords(filename):
          res = []
          with open(filename, 'r') as f:
              for line in f:
                  res.append(line.strip())
          return '\n'.join(res)
      
      # 准备列表
      file_path = 'tmpdata/'
      file_name_list = []
      for home, dirs, files in os.walk(file_path):
          for file_name in files:
              file_name_list.append(os.path.join(home, file_name))
      
      # 多线程执行
      pool = ThreadPool(4)
      results = pool.map(getWords, file_name_list)
      
      # 查看返回的结果
      print results[0]
      
      # 关闭线程
      pool.close()
      pool.join()

       

你可能感兴趣的:(自然语言处理,python,函数,python,综合)