工匠若水

Python3.X 爬虫实战（并发爬取）

【工匠若水 http://blog.csdn.net/yanbober 未经允许严禁转载，请尊重作者劳动成果。私信联系我】

1 背景

在这一系列开始前我们就说过，简单的爬虫很容易，但是要完成一个高效健壮的爬虫不是一个简单的事情，这一系列我们已经明白了爬虫相关的如下核心知识点。

《正则表达式基础》
《Python3.X 爬虫实战（先爬起来嗨）》
《Python3.X 爬虫实战（静态下载器与解析器）》

基于上面这几篇其实我们把爬虫当作自己便利的开发工具来使用基本上是够了（譬如老板让你定期留意观察自己做的应用功能上线后的用户行为数据，方便开发把握功能潜在风险，这个其实我们就可以写个 Python 爬虫小程序去后台定期查，然后定期邮件发送到我的邮箱，这样就不用自己老记着这回事然后去上网页操作了），但是对于动态网页爬取我们还未探讨、对于爬取数据处理我们也没探讨、对于爬取性能问题我们也没探讨。。。我靠，还有很多东西等待我们去发掘，MLGB，那我们这一篇就先探讨下 Python 爬虫的并发爬取，其实就是 Python 的并发，呜呜！

之所以讨论这个话题是为了解决《Python3.X 爬虫实战（静态下载器与解析器）》一文中 LXml 解析爬取美图录美女图片网站的效率问题，你会发现我们上一篇中那个程序的执行效率非常低，爬取完那些妹子图需要很就，因为他们是顺序的，加上我们还没有对妹子图网站进行全站爬取，如果要全站爬取那就是个相当恐怖的事情了，不信我们可以通过《Python3.X 爬虫实战（先爬起来嗨）》一文介绍的 site 方式查看这个站点有多少页面可以爬取，如下：

这还不算特别多，但我们已经无法忍受这么慢的爬取速度了，所以我们就要想办法解决这个问题，也就是这一篇要探讨的问题，不过首先你得具备 Python 并发编程的基础，如果还不 OK 可以看看知乎上 Python 之美的 Python 并发编程系列文章，讲的不错，或者去看看 Python 核心编程一书。

【工匠若水 http://blog.csdn.net/yanbober 未经允许严禁转载，请尊重作者劳动成果。私信联系我】

2 Python 3.X 并发铺垫

其实这一小节没必要存在的，但是为了补全就列出来了（注意：如果自己具备并发基础就直接移步 Part3 并发爬虫实战）。对于程序的进程、线程关系及区别的概念其实是不区分具体编程语言的，也就说如果我们过去在计算机基础、Unix 高级 C 语言编程、Java 编程、Android 编程等学习过进程与线程的概念，那么 Python 的并发也就好理解了，唯一区别是他们的语法和 API 名字及用法不同而已。

Python3 使用 POSIX 兼容的（pthreads）线程，提供了多个多线程编程模块，譬如 _thread、threading、Queue、concurrent.futures 包等，其中 _thread、threading 允许我们创建管理线程，主要区别就是 _thread （以前 Python 中的 thread，Python3 中已经不能再使用 thread 模块，为了兼容 Python3 将它重命名为 _thread 了）只提供了基本的线程及锁支持；而 threading 提供了更加牛逼的线程管理机制；Queue 为我们提供了一个用于多线程共享数据的队列；concurrent.futures包从 Python3.2 开始被纳入了标准库，其提供的ThreadPoolExecutor 和 ProcessPoolExecutor 是对 threading 和 multiprocessing 的高级抽象，暴露统一的接口来方便实现异步调用。

2-1 Python 3.X _thread 模块

这是个备受大家抛弃的 Python 并发模块，在 Python 低版本中叫 thread，高版本为了兼容叫 _thread，但是不推荐使用了，具体不推荐的原因大致如下：

_thread 模块的同步原语只有一个，比较弱，threading 却有很多；
_thread 模块之后出现了更加高级的 threading，你说你选哪个呢；
不支持守护线程等，使用 _thread 模块对于进程该何时结束基本无法控制（主线程结束后所有线程被没有任何警告和清理的情况下强制结束），而 threading 模块基本可以保证重要子线程结束后才退出主线程；

说到底就是因为我是个渣渣，驾驭不了 _thread 模块，哈哈，所以我无耻的选择了 threading 模块；多说无用，直接给段代码演示下吧，这段代码在各种语言的多线程中都是经典，没啥特殊的，如下：
[本实例完整源码点我获取 demo_thread.py]

import _thread
import time
'''
Python 3.X _thread 模块演示 Demo
当注释掉 self.lock.acquire() 和 self.lock.release() 后运行代码会发现最后的 count 为 467195 等随机值，并发问题。
当保留 self.lock.acquire() 和 self.lock.release() 后运行代码会发现最后的 count 为 1000000，锁机制保证了并发。
time.sleep(5) 就是为了解决 _thread 模块的诟病，注释掉的话子线程没机会执行了
'''
class ThreadTest(object):
    def __init__(self):
        self.count = 0
        self.lock = None

    def runnable(self):
        self.lock.acquire()
        print('thread ident is '+str(_thread.get_ident())+', lock acquired!')
        for i in range(0, 100000):
            self.count += 1
        print('thread ident is ' + str(_thread.get_ident()) + ', pre lock release!')
        self.lock.release()

    def test(self):
        self.lock = _thread.allocate_lock()
        for i in range(0, 10):
            _thread.start_new_thread(self.runnable, ())

if __name__ == '__main__':
    test = ThreadTest()
    test.test()
    print('thread is running...')
    time.sleep(5)
    print('test finish, count is:' + str(test.count))

所以很直观的看见咯，确实值得唾弃，我们还是看看 threading 吧。

2-2 Python 3.X threading 模块

关于 threading 模块提供的对象其实我们可以直接看看 threading.py 源码的__all__定义，里面有具体列举，如下：

__all__ = ['get_ident', 'active_count', 'Condition', 'current_thread',
           'enumerate', 'main_thread', 'TIMEOUT_MAX',
           'Event', 'Lock', 'RLock', 'Semaphore', 'BoundedSemaphore', 'Thread',
           'Barrier', 'BrokenBarrierError', 'Timer', 'ThreadError',
           'setprofile', 'settrace', 'local', 'stack_size']

看了这个定义和官网 API 后顺手搜到这篇文章不错（点我查看），感兴趣的可以自己去琢磨下咯，下面我们先给出 threading 模块下 Thread 类的一般用法，如下：
[本实例完整源码点我获取 demo_threading.py]

import threading
from threading import Thread
import time
'''
Python 3.X threading 模块演示 Demo

threading 的 Thread 类基本使用方式（继承重写 run 方法及直接传递方法）
'''
class NormalThread(Thread):
    '''
    重写类比 Java 的 Runnable 中 run 方法方式
    '''
    def __init__(self, name=None):
        Thread.__init__(self, name=name)
        self.counter = 0

    def run(self):
        print(self.getName() + ' thread is start!')
        self.do_customer_things()
        print(self.getName() + ' thread is end!')

    def do_customer_things(self):
        while self.counter < 10:
            time.sleep(1)
            print('do customer things counter is:'+str(self.counter))
            self.counter += 1


def loop_runner(max_counter=5):
    '''
    直接被 Thread 调用方式
    '''
    print(threading.current_thread().getName() + " thread is start!")
    cur_counter = 0
    while cur_counter < max_counter:
        time.sleep(1)
        print('loop runner current counter is:' + str(cur_counter))
        cur_counter += 1
    print(threading.current_thread().getName() + " thread is end!")


if __name__ == '__main__':
    print(threading.current_thread().getName() + " thread is start!")

    normal_thread = NormalThread("Normal Thread")
    normal_thread.start()

    loop_thread = Thread(target=loop_runner, args=(10,), name='LOOP THREAD')
    loop_thread.start()

    loop_thread.join()
    normal_thread.join()

    print(threading.current_thread().getName() + " thread is end!")

怎么样，最直接的感触就是再也不用像 _thread 那样让主线程预估结束时间等待子线程结束，使用 Thread 类以后直接可以使用 join 方式等待子线程结束，当然还有别的方式，自己可以琢磨；我们会发现其两种写法和 Java 线程非常类似，很棒，下面我们再给出简单的同步锁处理案例，如下：
[本实例完整源码点我获取 demo_threading_lock.py]

'''
Python 3.X threading 模块演示 Demo

threading 锁同步机制
当注释掉 self.lock.acquire() 和 self.lock.release() 后运行代码会发现最后的 count 为 467195 等，并发问题。
当保留 self.lock.acquire() 和 self.lock.release() 后运行代码会发现最后的 count 为 1000000，锁机制保证了并发。
'''
import threading
from threading import Thread

class LockThread(Thread):
    count = 0

    def __init__(self, name=None, lock=None):
        Thread.__init__(self, name=name)
        self.lock = lock

    def run(self):
        self.lock.acquire()
        print('thread is '+threading.current_thread().getName()+', lock acquired!')
        for i in range(0, 100000):
            LockThread.count += 1
        print('thread is '+threading.current_thread().getName()+', pre lock release!')
        self.lock.release()


if __name__ == '__main__':
    threads = list()
    lock = threading.Lock()
    for i in range(0, 10):
        thread = LockThread(name=str(i), lock=lock)
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()
    print('Main Thread finish, LockThread.count is:'+str(LockThread.count))

对于一般的并发同步使用 Lock 就足够了，简单吧，关于其他的锁机制（上面__all__ 的定义）自己可以参考其他资料进行学习，这里点到为止，下面我们再来看看爬虫中常用的线程优先级队列，如下：

Python3 的 Queue 模块提供了同步、线程安全队列类，包括先入先出队列 Queue、后入先出队列 LifoQueue 和优先级队列 PriorityQueue，这些队列都实现了锁机制，可以在多线程中直接使用，也可以用这些队列来实现线程间的同步，下面给出一个简单但是经典的示例（生产消费者问题），如下：
[本实例完整源码点我获取 demo_threading_queue.py]

from queue import Queue
from random import randint
from threading import Thread
from time import sleep
'''
Python 3.X threading 与 Queue 结合演示 Demo
经典的并发生产消费者模型
'''

class TestQueue(object):
    def __init__(self):
        self.queue = Queue(2)

    def writer(self):
        print('Producter start write to queue.')
        self.queue.put('key', block=1)
        print('Producter write to queue end. size is:'+str(self.queue.qsize()))

    def reader(self):
        value = self.queue.get(block=1)
        print('Consumer read from queue end. size is:'+str(self.queue.qsize()))

    def producter(self):
        for i in range(5):
            self.writer()
            sleep(randint(0, 3))

    def consumer(self):
        for i in range(5):
            self.reader()
            sleep(randint(2, 4))

    def go(self):
        print('TestQueue Start!')
        threads = []
        functions = [self.consumer, self.producter]
        for func in functions:
            thread = Thread(target=func, name=func.__name__)
            thread.start()
            threads.append(thread)
        for thread in threads:
            thread.join()
        print('TestQueue Done!')

if __name__ == '__main__':
    TestQueue().go()

可以看到，一般与爬虫相关常见和常用的 Python3 线程相关东西主要就上面这些，当然还有一些高端的用法和高端的线程类我们没有提到，这些需要我们自己去积累和依据自己爬虫需求选择合适的线程辅助类；这里我们篇幅有限不再展开，因为对于任何语言用好线程并发本来就是一个非常有深度的方向，涉及的问题也很多，但是对于一般业务来说上面的足矣。

2-3 Python 3.X 进程模块

上面我们介绍了 Python3 的 thread 并发相关基础，我们都知道除过多线程还有多进程，其内存空间划分等机制都是不一样的，这是在别的语言我们都知道的。然而在 Python 中如果我们想充分使用多核 CPU 资源，那就得使用多进程，Python 给我们提供了非常好用的多进程模块包 multiprocessing，其支持子进程、通信和共享数据等工具操作，非常棒。

下面先来看下 multiprocessing 的 Process 一般用法套路吧（其实完全类似 threading 用法，只不过含义和实质不同而已），如下：
[本实例完整源码点我获取 demo_multiprocessing.py]

import multiprocessing
import time
from multiprocessing import Process
'''
Python 3.X multiprocess 模块演示 Demo
其实完全类似 threading 用法，只不过含义和实质不同而已
multiprocess 的 Process 类基本使用方式（继承重写 run 方法及直接传递方法）
'''
class NormalProcess(Process):
    def __init__(self, name=None):
        Process.__init__(self, name=name)
        self.counter = 0

    def run(self):
        print(self.name + ' process is start!')
        self.do_customer_things()
        print(self.name + ' process is end!')

    def do_customer_things(self):
        while self.counter < 10:
            time.sleep(1)
            print('do customer things counter is:'+str(self.counter))
            self.counter += 1


def loop_runner(max_counter=5):
    print(multiprocessing.current_process().name + " process is start!")
    cur_counter = 0
    while cur_counter < max_counter:
        time.sleep(1)
        print('loop runner current counter is:' + str(cur_counter))
        cur_counter += 1
    print(multiprocessing.current_process().name + " process is end!")


if __name__ == '__main__':
    print(multiprocessing.current_process().name + " process is start!")
    print("cpu count:"+str(multiprocessing.cpu_count())+", active chiled count:"+str(len(multiprocessing.active_children())))
    normal_process = NormalProcess("NORMAL PROCESS")
    normal_process.start()

    loop_process = Process(target=loop_runner, args=(10,), name='LOOP PROCESS')
    loop_process.start()

    print("cpu count:" + str(multiprocessing.cpu_count()) + ", active chiled count:" + str(len(multiprocessing.active_children())))
    normal_process.join()
    loop_process.join()
    print(multiprocessing.current_process().name + " process is end!")

怎么样，给出的两种 Process 使用方式很像上面的 Thread，只是含义和原理及内存概念有了区别。有了这个基础我们一样可以来看看 Process 的并发锁和多进程数据共享机制使用（与 Thread 的内存区别，任何语言通用），如下：
[本实例完整源码点我获取 demo_multiprocessing_lock.py]

'''
Python 3.X multiprocess 模块演示 Demo

multiprocess 锁同步机制及进程数据共享机制
当注释掉 self.lock.acquire() 和 self.lock.release() 后运行代码会发现最后的 count 为 467195 等，并发问题。
当保留 self.lock.acquire() 和 self.lock.release() 后运行代码会发现最后的 count 为 1000000，锁机制保证了并发。
'''
import multiprocessing
from multiprocessing import Process

class LockProcess(Process):
    def __init__(self, name=None, lock=None, m_count=None):
        Process.__init__(self, name=name)
        self.lock = lock
        self.m_count = m_count

    def run(self):
        self.lock.acquire()
        print('process is '+multiprocessing.current_process().name+', lock acquired!')
        #性能问题，100000次循环，所以这里优化为先从多进程共享拿出来计算完再放回多进程共享
        count = self.m_count.value;
        for i in range(0, 100000):
            count += 1
        self.m_count.value = count
        print('process is '+multiprocessing.current_process().name+', pre lock release!')
        self.lock.release()


if __name__ == '__main__':
    processes = list()
    lock = multiprocessing.Lock()
    m_count = multiprocessing.Manager().Value('count', 0)

    for i in range(0, 10):
        process = LockProcess(name=str(i), lock=lock, m_count=m_count)
        process.start()
        processes.append(process)

    for process in processes:
        process.join()
    print('Main Process finish, LockProcess.count is:' + str(m_count.value))

哎呀呀，矫情一把，受不了自己，都和 threading 类似是一个套路，唯一区别都是以为线程和进程本质区别导致的，而使用方式却没区别，所以 multiprocessing 的 Queue 类似 threading 的，不再举例了，具体自己实战吧。

2-4 Python 3.X 并发池

从 Python 并发线程到并发进程一步一步走到这你会发现 Python 标准库给咱们提供的 _thread、threading 和 multiprocessing 模块是非常棒的，但是你有没有想过（在其他语言也会遇到，譬如 C\Java 等）在实际项目中大规模的频繁创建、销毁线程或者进程是一件非常消耗资源的事情，所以池的概念就这么诞生了（空间换时间）。好在 Python3.2 开始内置标准库为我们提供了 concurrent.futures 模块，模块包含了 ThreadPoolExecutor 和 ProcessPoolExecutor 两个类（其基类是 Executor 抽象类，不可直接使用），实现了对 threading 和 multiprocessing 的高级抽象，对编写线程池、进程池提供了直接的支持，我们只用将相应的 tasks 放入线程池、进程池中让其自动调度而不用自己去维护 Queue 来担心死锁问题。

先来看看线程池样例：
[本实例完整源码点我获取 demo_thread_pool_executor.py]

'''
Python 3.X ThreadPoolExecutor 模块演示 Demo
'''
import concurrent
from concurrent.futures import ThreadPoolExecutor
from urllib import request

class TestThreadPoolExecutor(object):
    def __init__(self):
        self.urls = [
            'https://www.baidu.com/',
            'http://blog.jobbole.com/',
            'http://www.csdn.net/',
            'https://juejin.im/',
            'https://www.zhihu.com/'
        ]

    def get_web_content(self, url=None):
        print('start get web content from: '+url)
        try:
            headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            return request.urlopen(req).read().decode("utf-8")
        except BaseException as e:
            print(str(e))
            return None
        print('get web content end from: ' + str(url))

    def runner(self):
        thread_pool = ThreadPoolExecutor(max_workers=2, thread_name_prefix='DEMO')
        futures = dict()
        for url in self.urls:
            future = thread_pool.submit(self.get_web_content, url)
            futures[future] = url

        for future in concurrent.futures.as_completed(futures):
            url = futures[future]
            try:
                data = future.result()
            except Exception as e:
                print('Run thread url ('+url+') error. '+str(e))
            else:
                print(url+'Request data ok. size='+str(len(data)))
        print('Finished!')

if __name__ == '__main__':
    TestThreadPoolExecutor().runner()

再来看看进程池实例，如下：
[本实例完整源码点我获取 demo_process_pool_executor.py]

'''
Python 3.X ProcessPoolExecutor 模块演示 Demo
'''
import concurrent
from concurrent.futures import ProcessPoolExecutor
from urllib import request

class TestProcessPoolExecutor(object):
    def __init__(self):
        self.urls = [
            'https://www.baidu.com/',
            'http://blog.jobbole.com/',
            'http://www.csdn.net/',
            'https://juejin.im/',
            'https://www.zhihu.com/'
        ]

    def get_web_content(self, url=None):
        print('start get web content from: '+url)
        try:
            headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            return request.urlopen(req).read().decode("utf-8")
        except BaseException as e:
            print(str(e))
            return None
        print('get web content end from: ' + str(url))

    def runner(self):
        process_pool = ProcessPoolExecutor(max_workers=4)
        futures = dict()
        for url in self.urls:
            future = process_pool.submit(self.get_web_content, url)
            futures[future] = url

        for future in concurrent.futures.as_completed(futures):
            url = futures[future]
            try:
                data = future.result()
            except Exception as e:
                print('Run process url ('+url+') error. '+str(e))
            else:
                print(url+'Request data ok. size='+str(len(data)))
        print('Finished!')

if __name__ == '__main__':
    TestProcessPoolExecutor().runner()

唉，任何编程语言都是互通的，真的是这样，你只要深入理解一门语言，其他的都很容易，要适应的只是语法；对于 Python 3 的并发其实还有很多知识点需要我们探索的，譬如异步 IO、各种特性锁等等，我们要依据自己的需求去选择使用合适的并发处理，只有这样才是最合适的，总之学习并发就一个套路—–实战观察思考。

【工匠若水 http://blog.csdn.net/yanbober 未经允许严禁转载，请尊重作者劳动成果。私信联系我】

3 并发爬虫实战

屌爆了吧，上面我们 BB 了那么多关于 Python 并发的东西（虽然很多没 BB 到，毕竟不是专门介绍 Python 3 并发的）就是为了特么的这个 Part 的实战爬虫例子，不然有啥意义呢，废话不多说了，我们之前写的爬虫都是单个主线程的，他们有个很要命的问题就是一旦一个链接爬取卡住不动了，其他就真的只能干瞪眼了，还有一个问题就是我的电脑这么牛逼为毛我的爬虫还是串行爬取那么慢，所以下面两个实例片段就是用来终结这两个诟病的。

3-1 多线程爬虫实战

啥都别和老夫说，上来就是干，上来就扔代码，别再告诉我用多线程演示了，直接上线程池，爬虫不多解释，具体看如下代码的注释或者自己跑一下就明白了。
[本实例完整源码点我获取 spider_multithread.py]

import os
from concurrent.futures import ThreadPoolExecutor
from urllib import request
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
'''
使用单独并发线程池爬取解析及单独并发线程池存储解析结果示例
爬取百度百科Android词条简介及该词条链接词条的简介信息，将结果输出到当前目录下output目录
'''

class CrawlThreadPool(object):
    '''
    启用最大并发线程数为5的线程池进行URL链接爬取及结果解析；
    最终通过crawl方法的complete_callback参数进行爬取解析结果回调
    '''
    def __init__(self):
        self.thread_pool = ThreadPoolExecutor(max_workers=5)

    def _request_parse_runnable(self, url):
        print('start get web content from: ' + url)
        try:
            headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            content = request.urlopen(req).read().decode("utf-8")
            soup = BeautifulSoup(content, "html.parser", from_encoding='utf-8')
            new_urls = set()
            links = soup.find_all("a", href=re.compile(r"/item/\w+"))
            for link in links:
                new_urls.add(urljoin(url, link["href"]))
            data = {"url": url, "new_urls": new_urls}
            data["title"] = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1").get_text()
            data["summary"] = soup.find("div", class_="lemma-summary").get_text()
        except BaseException as e:
            print(str(e))
            data = None
        return data

    def crawl(self, url, complete_callback):
        future = self.thread_pool.submit(self._request_parse_runnable, url)
        future.add_done_callback(complete_callback)


class OutPutThreadPool(object):
    '''
    启用最大并发线程数为5的线程池对上面爬取解析线程池结果进行并发处理存储；
    '''
    def __init__(self):
        self.thread_pool = ThreadPoolExecutor(max_workers=5)

    def _output_runnable(self, crawl_result):
        try:
            url = crawl_result['url']
            title = crawl_result['title']
            summary = crawl_result['summary']
            save_dir = 'output'
            print('start save %s as %s.txt.' % (url, title))
            if os.path.exists(save_dir) is False:
                os.makedirs(save_dir)
            save_file = save_dir + os.path.sep + title + '.txt'
            if os.path.exists(save_file):
                print('file %s is already exist!' % title)
                return
            with open(save_file, "w") as file_input:
                file_input.write(summary)
        except Exception as e:
            print('save file error.'+str(e))

    def save(self, crawl_result):
        self.thread_pool.submit(self._output_runnable, crawl_result)


class CrawlManager(object):
    '''
    爬虫管理类，负责管理爬取解析线程池及存储线程池
    '''
    def __init__(self):
        self.crawl_pool = CrawlThreadPool()
        self.output_pool = OutPutThreadPool()

    def _crawl_future_callback(self, crawl_url_future):
        try:
            data = crawl_url_future.result()
            for new_url in data['new_urls']:
                self.start_runner(new_url)
            self.output_pool.save(data)
        except Exception as e:
            print('Run crawl url future thread error. '+str(e))

    def start_runner(self, url):
        self.crawl_pool.crawl(url, self._crawl_future_callback)


if __name__ == '__main__':
    root_url = 'http://baike.baidu.com/item/Android'
    CrawlManager().start_runner(root_url)

这效率比起该系列第一篇讲的百科爬虫简直高的不能再高了，嗖嗖的，输出结果部分截图如下：

3-2 多进程爬虫实战

啥也不多说，看完多线程爬虫的牛逼效率自然就该看多进程爬虫的牛逼之处了，也一样，别给我说啥概念，上面说的足够多了，下面撸起袖子就是上代码，也别问是啥爬虫，看注释就行，如下：
[本实例完整源码点我获取 spider_multiprocess.py]

import os
from concurrent.futures import ProcessPoolExecutor
from urllib import request
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
'''
使用进程池爬取解析及存储解析结果示例
爬取百度百科Android词条简介及该词条链接词条的简介信息，将结果输出到当前目录下output目录
'''


class CrawlProcess(object):
    '''
    配合进程池进行URL链接爬取及结果解析；
    最终通过crawl方法的complete_callback参数进行爬取解析结果回调
    '''
    def _request_parse_runnable(self, url):
        print('start get web content from: ' + url)
        try:
            headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"}
            req = request.Request(url, headers=headers)
            content = request.urlopen(req).read().decode("utf-8")
            soup = BeautifulSoup(content, "html.parser", from_encoding='utf-8')
            new_urls = set()
            links = soup.find_all("a", href=re.compile(r"/item/\w+"))
            for link in links:
                new_urls.add(urljoin(url, link["href"]))
            data = {"url": url, "new_urls": new_urls}
            data["title"] = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1").get_text()
            data["summary"] = soup.find("div", class_="lemma-summary").get_text()
        except BaseException as e:
            print(str(e))
            data = None
        return data

    def crawl(self, url, complete_callback, process_pool):
        future = process_pool.submit(self._request_parse_runnable, url)
        future.add_done_callback(complete_callback)


class OutPutProcess(object):
    '''
    配合进程池对上面爬取解析进程结果进行进程池处理存储；
    '''
    def _output_runnable(self, crawl_result):
        try:
            url = crawl_result['url']
            title = crawl_result['title']
            summary = crawl_result['summary']
            save_dir = 'output'
            print('start save %s as %s.txt.' % (url, title))
            if os.path.exists(save_dir) is False:
                os.makedirs(save_dir)
            save_file = save_dir + os.path.sep + title + '.txt'
            if os.path.exists(save_file):
                print('file %s is already exist!' % title)
                return None
            with open(save_file, "w") as file_input:
                file_input.write(summary)
        except Exception as e:
            print('save file error.'+str(e))
        return crawl_result

    def save(self, crawl_result, process_pool):
        process_pool.submit(self._output_runnable, crawl_result)


class CrawlManager(object):
    '''
    爬虫管理类，进程池负责统一管理调度爬取解析及存储进程
    '''
    def __init__(self):
        self.crawl = CrawlProcess()
        self.output = OutPutProcess()
        self.crawl_pool = ProcessPoolExecutor(max_workers=8)
        self.crawl_deep = 100   #爬取深度
        self.crawl_cur_count = 0

    def _crawl_future_callback(self, crawl_url_future):
        try:
            data = crawl_url_future.result()
            self.output.save(data, self.crawl_pool)
            for new_url in data['new_urls']:
                self.start_runner(new_url)
        except Exception as e:
            print('Run crawl url future process error. '+str(e))

    def start_runner(self, url):
        if self.crawl_cur_count > self.crawl_deep:
            return
        self.crawl_cur_count += 1
        self.crawl.crawl(url, self._crawl_future_callback, self.crawl_pool)


if __name__ == '__main__':
    root_url = 'http://baike.baidu.com/item/Android'
    CrawlManager().start_runner(root_url)

唉，效果就不多说了，和上面线程池爬取效果类似，只是换为了进程池爬取而已。

【工匠若水 http://blog.csdn.net/yanbober 未经允许严禁转载，请尊重作者劳动成果。私信联系我】

5 并发爬虫总结

啥都不说，这一篇一下搞得有点不像在介绍并发爬虫，而成了 Python3 并发编程基础了，坑爹啊，无论怎样最后我们还是给出了两个基于 Python3 线程池、进程池的并发爬虫小案例，麻雀虽小，五脏俱全。虽然本篇对并发爬虫（Python3 并发）没有进行深入介绍，但是基本目的达到了，关于并发深入学习不是一两天的功夫，并发在大型项目中是个很有学问的东西，要走的路还有很长，不过有了这篇的铺垫我们就可以自己去摸索分布式爬虫的基本原理，其实就是多进程爬虫，还有就是我们可以自己去摸索下 Python 的异步 IO 机制，那才是核心，那也不是一两篇就能说明白的东西。

^-^当然咯，看到这如果发现对您有帮助的话不妨扫描二维码赏点买羽毛球的小钱（现在球也挺贵的），既是一种鼓励也是一种分享，谢谢！

【工匠若水 http://blog.csdn.net/yanbober 未经允许严禁转载，请尊重作者劳动成果。私信联系我】

你可能感兴趣的:(Python3)

Python多版本管理与pip升级全攻略：解决冲突与高效实践码界奇点 Python python pip 开发语言 python3.11 源代码管理虚拟现实依赖倒置原则
引言Python作为最流行的编程语言之一，其版本迭代速度与生态碎片化给开发者带来了巨大挑战。据统计，超过60%的Python开发者需要同时维护基于Python3.6+和Python2.7的项目。本文将系统解决以下核心痛点：如何安全地在同一台机器上管理多个Python版本pip依赖冲突的根治方案符合PEP标准的生产环境最佳实践第一部分：Python多版本管理核心方案1.1系统级多版本共存方案Wind
Ubuntu基础（Python虚拟环境和Vue） aaiier ubuntu python linux
Python虚拟环境sudoaptinstallpython3python3-venv进入项目目录cdXXX创建虚拟环境python3-mvenvvenv激活虚拟环境sourcevenv/bin/activate退出虚拟环境deactivateVue安装Node.js和npm#安装Node.js和npm（Ubuntu默认仓库可能版本较旧，适合入门）sudoaptinstallnodejsnpm#验
Python3 内置函数 AI老李 python python
关键要点Python3的内置函数是解释器直接提供的，无需导入即可使用，涵盖数据类型转换、数学操作、序列处理等多种功能。推荐使用官方文档、菜鸟教程和腾讯云开发者社区的中文资源，适合初学者和中级学习者。资源提供详细解释和示例，学习时可结合实际项目实践。简介Python3的内置函数是编程中常用的工具，方便用户快速实现各种操作。以下是几个主要资源，帮助您学习这些函数的用法。资源推荐Python官方文档：内
安装uwsgi
安装uWSGIpip3installuwsgi启动命令/usr/local/python3/bin/uwsgi--socket0.0.0.0:8889--workersrun_server:app_server--master--processes4--threads2--stats0.0.0.0:9191在项目目录下新建[uwsgi]#web应用的入口模块名称module=run_server:
vllm本地部署bge-reranker-v2-m3模型API服务实战教程雷电法王大模型部署 linux python vscode language model
文章目录一、说明二、配置环境2.1安装虚拟环境2.2安装vllm2.3对应版本的pytorch安装2.4安装flash_attn2.5下载模型三、运行代码3.1启动服务3.2调用代码验证一、说明本文主要介绍vllm本地部署BAAI/bge-reranker-v2-m3模型API服务实战教程本文是在Ubuntu24.04+CUDA12.8+Python3.12环境下复现成功的二、配置环境2.1安装虚
三网BGP服务器——CDN加速的底层基石群联云防护小杜安全问题汇总服务器 python 运维游戏安全自动化网络
为什么跨网访问会成为业务性能杀手？场景痛点当电信用户访问联通机房的资源时，平均延迟高达120ms以上，而跨网丢包率可达15%。传统单线机房导致30%的用户体验直接下降。BGP协议的核心价值#三网路由优化模拟器（Python3）importrandomdefbgp_route_selection(user_isp,cdn_nodes):#用户ISP：1=电信2=移动3=联通#节点示例：{'node1
Mac 电脑crontab执行定时任务【Python 实战】 qifengle2014 Linux Docker Java Python技术分享合集 macos python 开发语言
1、crontab-e编辑定时任务列表crontab-e查看当前定时任务列表，长按i编辑，编辑完之后按esc退出编辑，然后输入:wq保存并提出。如下：(base)charles@zl~%crontab-e5815***/Library/Frameworks/Python.framework/Versions/3.8/bin/python3/Users/charles/Documents/first
Ubuntu系统下pip install的accelerate包没有安装至conda环境下，而是错误放入.local文件中
服务器上跑模型时莫名报了一个没有‘torch’包的错误Traceback(mostrecentcalllast):File"/home/ubuntu/.local/bin/accelerate",line5,infromaccelerate.commands.accelerate_cliimportmainFile"/home/ubuntu/.local/lib/python3.10/site-p
Python核心编程-语法范式与高阶应用实践 Stara-AI Python 装饰器内存管理机制 PEP8工程规范
一、Python基础语法、变量、列表、字典等运用1.运行python程序的两种方式1.交互式即时得到程序的运行结果2.脚本方式把程序写到文件里(约定俗称文件名后缀为.py),然后用python解释器解释执行其中的内容2.python程序运行的三个步骤python3.8C:\a\b\c.py1.先启动python3.8解释器,此时相当于启动了一个文本编辑器2.解释器会发送系统调用，把c.py的内容从
Python 3.9.0 64位：完整安装与配置教程 D哥有个初二君
本文还有配套的精品资源，点击获取简介：Python3.9.064位安装包为Windows系统上的Python最新版本，特别适用于数据处理、Web开发及自动化脚本等领域。本教程介绍了如何在HarmonyOS开发环境中安装并配置Python3.9.064位版本，包括系统兼容性、下载安装、环境变量配置、安装验证及pip更新。同时提供了Python基础知识，如基础语法、模块导入、面向对象编程、异常处理和文
快速了解python中的库小王爱学人工智能 python 开发语言
一、标准库1.标准库的介绍标准库是Python自带的库，无须下载，可直接使用。我们可以通过以下代码获取标准库的目录地址：importsysprint(sys.exec_prefix)此代码中的sys.exec_prefix可用于获取当前python的安装目录地址（即根目录）。执行结果：C:\Users\XY\AppData\Local\Programs\Python\Python39不同的操作系统
python虚拟环境
#列出虚拟环境列表python3-mvenv--list#创建python虚拟环境python3-mvenv虚拟环境名称#激活ptyhon虚拟环境source虚拟环境名称/bin/activate#python虚拟环境配置pip源vim虚拟环境名称/pip.conf[global]index-url=http://mirrors.aliyun.com/pypi/simple/[install]tr
XSStrike 进行 XSS 漏洞测试
XSStrike是一个功能强大的XSS漏洞测试工具，专为检测、验证和利用反射型、存储型、DOM型XSS漏洞而设计，适合配合手工测试，也可用于自动化发现。️1.安装XSStrike确保系统中有Python3和git：gitclonehttps://github.com/s0md3v/XSStrike.gitcdXSStrikepip3install-rrequirements.txt2.基本用法✅测
本地Qwen中医问诊小程序系统开发 Kelaru AI大模型小程序 AI python flask project
一、后端API（Flask+Qwen）1.环境准备1.1安装Python3（如未安装）```bashbrewinstallpython```1.2创建虚拟环境并激活python3-mvenvqwen_envsourceqwen_env/bin/activate1.3安装依赖bashpipinstalltorchtransformersflaskflask-cors2.编写后端API代码新建`app
python2.7和python3的区别-Python2.7与Python3之间的主要区别 weixin_39989215
1.使用__future__模块Python3.X引入了一些与Python2不兼容的关键字和特性。在Python2中，可以通过内置的__future__模块导入这些新内容。如果你希望在Python2中写的代码也可以在Python3.X中运行，那么建议使用__fufure__模块。2.print函数虽然print语法是Python3中一个很小的改动，但是依然值得提一下：Python2中的print语
python2.7与3.7区别_python2.7与python3.7的区别是什么 weixin_39977642 python2.7与3.7区别
python2.7与python3.7的区别：1、print语法的使用规则不同；2、“raw_input()”和“input()”方法的使用方式不同；3、cmp()函数的用途不同；4、string的使用方式不同。区别一:print语法使用Python2.7print语法使用>>>print"HelloPython"Python3.7print语法使用>>>print("HelloPython")例
【零基础学AI】第36讲：GPT模型原理 1989 0基础学AI 人工智能 gpt lstm rnn YOLO 目标检测
本节课你将学到理解GPT模型的基本原理掌握Transformer解码器的工作机制实现一个简单的文本生成应用开始之前环境要求Python3.8+安装包：pipinstalltransformerstorch硬件：CPU即可运行（GPU可加速）前置知识了解基本的神经网络概念（第23讲内容）熟悉Python编程基础核心概念什么是GPT？GPT（GenerativePre-trainedTransform
【零基础学AI】第31讲：目标检测 - YOLO算法 1989 0基础学AI 人工智能目标检测 YOLO rnn lstm tensorflow
本节课你将学到YOLO算法的核心思想和工作原理如何使用YOLO进行物体检测构建一个简单的物体检测系统开始之前环境要求Python3.8+需要安装的包：opencv-python,numpy,matplotlib硬件要求：推荐使用GPU（非必须）前置知识基本Python编程能力了解卷积神经网络（CNN）的基本概念（第24讲内容）核心概念什么是目标检测？目标检测就像教计算机"看"图片中的物体。它不仅要
python易错题赴335 python 开发语言
1.下列不属于IPO程序编写的方法是：（c）A:inputB:processC:programD:output程序的编写方法IPO指input(输入)、process(处理)、output(输出)2.下面哪个不是python的编程方式：（A）A：自然语言B:面向过程C:面向对象D:语句Python是目前最接近自然语言的编程语言，但是不属于自然语言3.关于Python2.x版本和Python3.x版
asyncio.to_thread() Python同步代码异步化工具 serve the people 日常琐问 python 网络服务器
asyncio.to_thread()是Python3.9+引入的异步执行同步代码的工具，它通过线程池将同步操作转为异步执行，避免阻塞事件循环。其机制与async/await有本质区别，但可以结合使用。一、核心机制对比特性async/await原生异步机制asyncio.to_thread()线程池机制执行环境单线程，事件循环调度协程交替执行多线程，将同步任务放入线程池执行阻塞风险仅在await处
Ubuntu基础（监控重启和查找程序） aaiier ubuntu 服务器 linux
查看是否启动psaux|grepserver.py使用find命令全局搜索sudofind/-name"server.py"2>/dev/null直接在终端启动（前台运行）sudopython3/root/server.py后台运行（使用nohup）sudonohuppython3/root/server.py>/var/log/server.log2>&1&使用systemd管理（推荐方案）先查
python玛丽冒险游戏开发详解
一、游戏运行环境《玛丽冒险》运行环境要求：Python3.xPygame库（安装命令：pipinstallpygame）PyCharmIDE（或其他PythonIDE）二、核心类解析1.地图滚动类（MyMap）classMyMap():def__init__(self,x,y):self.bg=pygame.image.load("image/bg.png").convert_alpha()sel
在Windows系统中配置Python 3.11环境安装教程俊星学长 windows python3.11
在Windows系统中配置Python3.11环境安装教程是一个相对直接且简单的过程，但为了确保所有步骤都被详细覆盖，我将分步介绍，并提供必要的背景信息和注意事项。以下是详细的安装教程：一、下载Python3.11首先，需要从Python的官方网站下载Python3.11的安装包。请按照以下步骤操作：访问Python官方网站：打开浏览器，访问Python的官方网站。在网站首页，找到并点击“Down
MCP Streamable HTTP 样例（qbit） pythonagent
前言模型上下文协议（ModelContextProtocol，MCP），是由Anthropic推出的开源协议，旨在实现大语言模型与外部数据源和工具的集成，用来在大模型和数据源之间建立安全双向的连接。本文代码技术栈Python3.11.8FastMCP2.10.3MCP的传输机制StandardInput/Output(stdio)StreamableHTTPServer-SentEvents(SS
Python爬虫实战：爬取ETF基金持仓变化 Python爬虫项目 python 爬虫开发语言信息可视化数据分析
1.项目背景ETF（Exchange-TradedFund，交易型开放式指数基金）作为一种在交易所上市交易的基金，其持仓信息对于投资者具有重要参考价值。了解ETF的持仓变化，可以帮助投资者判断市场趋势和资金流向。本文将通过Python爬虫技术，自动化地获取ETF基金的持仓变化数据，进行存储和分析。2.技术选型与环境准备2.1技术选型编程语言：Python3.8+爬虫框架：Scrapy数据解析：Be
python 包管理工具uv
uv--versionuvpythonfinduvpythonlistexportUV_DEFAULT_INDEX="https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"#换成私有的repoexportUV_HTTP_TIMEOUT=120uvpythoninstall3.12uvvenvmyenv--python3.12--seeduvhtt
python爬取头条视频_Python爬虫：爬取某日头条某瓜视频，有/无水印两种方法孤灯苦狗 python爬取头条视频
前言本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。以下文章来源于青灯编程，作者：清风Python爬虫、数据分析、网站开发等案例教程视频免费在线观看https://space.bilibili.com/523606542基本开发环境Python3.6Pycharm相关模块的使用importtimeimportosimportreimportreq
AttributeError: module ‘distutils‘ has no attribute ‘version‘ 吴闹闹(●'◡'●) python 深度学习开发语言
错误：File"/root/miniconda3/envs/vidar/lib/python3.8/sitepackages/torch/utils/tensorboard/__init__.py",line4,inLooseVersion=distutils.version.LooseVersionAttributeError:module'distutils'hasnoattribute've
【零基础学AI】第30讲：生成对抗网络(GAN)实战 - 手写数字生成 1989 0基础学AI 人工智能生成对抗网络神经网络 python 机器学习近邻算法深度学习
本节课你将学到GAN的基本原理和工作机制使用PyTorch构建生成器和判别器DCGAN架构实现技巧训练GAN模型的实用技巧开始之前环境要求Python3.8+需要安装的包：pipinstalltorchtorchvisionmatplotlibnumpyGPU推荐（可大幅加速训练）前置知识第21讲TensorFlow基础第23讲神经网络原理基本PyTorch使用经验核心概念什么是GAN？GAN就像
基于流量特征分析的DDoS实时检测与缓解实战
问题场景当Web服务器突发大量SYNFlood攻击时，传统防火墙难以区分真实用户与伪造流量，导致业务中断。解决方案核心：动态流量指纹识别通过统计学习建立正常流量基线，实时拦截异常连接。#DDoS流量检测脚本（Python3+Scapy）fromscapy.allimport*fromcollectionsimportdefaultdictimporttimeTHRESHOLD=1000#每秒SYN
分享100个最新免费的高匿HTTP代理IP mcj8089 代理IP 代理服务器匿名代理免费代理IP 最新代理IP
推荐两个代理IP网站： 1. 全网代理IP：http://proxy.goubanjia.com/ 2. 敲代码免费IP：http://ip.qiaodm.com/ 120.198.243.130:80,中国/广东省 58.251.78.71:8088,中国/广东省 183.207.228.22:83,中国/
mysql高级特性之数据分区 annan211 java 数据结构 mongodb 分区 mysql
mysql高级特性 1 以存储引擎的角度分析，分区表和物理表没有区别。是按照一定的规则将数据分别存储的逻辑设计。器底层是由多个物理字表组成。 2 分区的原理分区表由多个相关的底层表实现，这些底层表也是由句柄对象表示，所以我们可以直接访问各个分区。存储引擎管理分区的各个底层表和管理普通表一样(所有底层表都必须使用相同的存储引擎)，分区表的索引只是
JS采用正则表达式简单获取URL地址栏参数 chiangfai js 地址栏参数获取
GetUrlParam:function GetUrlParam(param){ var reg = new RegExp("(^|&)"+ param +"=([^&]*)(&|$)"); var r = window.location.search.substr(1).match(reg); if(r!=null
怎样将数据表拷贝到powerdesigner (本地数据库表) Array_06 powerDesigner
================================================== 1、打开PowerDesigner12，在菜单中按照如下方式进行操作 file->Reverse Engineer->DataBase 点击后，弹出 New Physical Data Model 的对话框 2、在General选项卡中 Model name:模板名字，自
logbackのhelloworld 飞翔的马甲日志 logback
一、概述 1.日志是啥？当我是个逗比的时候我是这么理解的：log.debug()代替了system.out.print(); 当我项目工作时，以为是一堆得.log文件。这两天项目发布新版本，比较轻松，决定好好地研究下日志以及logback。传送门1：日志的作用与方法： http://www.infoq.com/cn/articles/why-and-how-log 上面的作
新浪微博爬虫模拟登陆随意而生新浪微博
转载自：http://hi.baidu.com/erliang20088/item/251db4b040b8ce58ba0e1235 近来由于毕设需要，重新修改了新浪微博爬虫废了不少劲，希望下边的总结能够帮助后来的同学们。现行版的模拟登陆与以前相比，最大的改动在于cookie获取时候的模拟url的请求
synchronized 香水浓 java thread
Java语言的关键字，可用来给对象和方法或者代码块加锁，当它锁定一个方法或者一个代码块的时候，同一时刻最多只有一个线程执行这段代码。当两个并发线程访问同一个对象object中的这个加锁同步代码块时，一个时间内只能有一个线程得到执行。另一个线程必须等待当前线程执行完这个代码块以后才能执行该代码块。然而，当一个线程访问object的一个加锁代码块时，另一个线程仍然
maven 简单实用教程 AdyZhang maven
1. Maven介绍 1.1. 简介 java编写的用于构建系统的自动化工具。目前版本是2.0.9，注意maven2和maven1有很大区别，阅读第三方文档时需要区分版本。 1.2. Maven资源见官方网站；The 5 minute test，官方简易入门文档；Getting Started Tutorial，官方入门文档；Build Coo
Android 通过 intent传值获得null aijuans android
我在通过intent 获得传递兑现过的时候报错，空指针,我是getMap方法进行传值，代码如下 1 2 3 4 5 6 7 8 9 public void getMap(View view){ Intent i =
apache 做代理报如下错误：The proxy server received an invalid response from an upstream baalwolf response
网站配置是apache＋tomcat,tomcat没有报错，apache报错是： The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /. Reason: Error reading fr
Tomcat6 内存和线程配置 BigBird2012 tomcat6
1、修改启动时内存参数、并指定JVM时区（在windows server 2008 下时间少了8个小时）在Tomcat上运行j2ee项目代码时，经常会出现内存溢出的情况，解决办法是在系统参数中增加系统参数： window下，在catalina.bat最前面 set JAVA_OPTS=-XX:PermSize=64M -XX:MaxPermSize=128m -Xms5
Karam与TDD bijian1013 Karam TDD
一.TDD 测试驱动开发（Test-Driven Development,TDD）是一种敏捷（AGILE）开发方法论，它把开发流程倒转了过来，在进行代码实现之前，首先保证编写测试用例，从而用测试来驱动开发（而不是把测试作为一项验证工具来使用）。 TDD的原则很简单： a.只有当某个
[Zookeeper学习笔记之七]Zookeeper源代码分析之Zookeeper.States bit1129 zookeeper
public enum States { CONNECTING, //Zookeeper服务器不可用，客户端处于尝试链接状态 ASSOCIATING, //？？？ CONNECTED, //链接建立，可以与Zookeeper服务器正常通信 CONNECTEDREADONLY, //处于只读状态的链接状态，只读模式可以在
【Scala十四】Scala核心八：闭包 bit1129 scala
Free variable A free variable of an expression is a variable that’s used inside the expression but not defined inside the expression. For instance, in the function literal expression (x: Int) => (x
android发送json并解析返回json ronin47 android
package com.http.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import
一份IT实习生的总结 brotherlamp PHP php资料 php教程 php培训 php视频
今天突然发现在不知不觉中自己已经实习了 3 个月了，现在可能不算是真正意义上的实习吧，因为现在自己才大三，在这边撸代码的同时还要考虑到学校的功课跟期末考试。让我震惊的是，我完全想不到在这 3 个月里我到底学到了什么，这是一件多么悲催的事情啊。同时我对我应该 get 到什么新技能也很迷茫。所以今晚还是总结下把，让自己在接下来的实习生活有更加明确的方向。最后感谢工作室给我们几个人这个机会让我们提前出来
据说是2012年10月人人网校招的一道笔试题-给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。将重物放到天平左侧，问在两边如何添加砝码 bylijinnan java
public class ScalesBalance { /** * 题目： * 给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。（假设N无限大，但一种重量的砝码只有一个） * 将重物放到天平左侧，问在两边如何添加砝码使两边平衡 * * 分析： * 三进制 * 我们约定括号表示里面的数是三进制，例如 47=(1202
dom4j最常用最简单的方法 chiangfai dom4j
要使用dom4j读写XML文档,需要先下载dom4j包,dom4j官方网站在 http://www.dom4j.org/目前最新dom4j包下载地址:http://nchc.dl.sourceforge.net/sourceforge/dom4j/dom4j-1.6.1.zip 解开后有两个包,仅操作XML文档的话把dom4j-1.6.1.jar加入工程就可以了,如果需要使用XPath的话还需要
简单HBase笔记 chenchao051 hbase
一、Client-side write buffer 客户端缓存请求描述：可以缓存客户端的请求，以此来减少RPC的次数，但是缓存只是被存在一个ArrayList中，所以多线程访问时不安全的。可以使用getWriteBuffer()方法来取得客户端缓存中的数据。默认关闭。二、Scan的Caching 描述： next( )方法请求一行就要使用一次RPC,即使
mysqldump导出时出现when doing LOCK TABLES daizj mysql mysqdump 导数据
　　执行　mysqldump -uxxx -pxxx -hxxx -Pxxxx database tablename > tablename.sql　导出表时，会报 mysqldump: Got error: 1044: Access denied for user 'xxx'@'xxx' to database 'xxx' when doing LOCK TABLES 解决
CSS渲染原理 dcj3sjt126com Web
从事Web前端开发的人都与CSS打交道很多，有的人也许不知道css是怎么去工作的，写出来的css浏览器是怎么样去解析的呢？当这个成为我们提高css水平的一个瓶颈时，是否应该多了解一下呢？一、浏览器的发展与CSS
《阿甘正传》台词 dcj3sjt126com
Part Ⅰ: 《阿甘正传》Forrest Gump经典中英文对白 Forrest: Hello! My names Forrest. Forrest Gump. You wanna Chocolate? I could eat about a million and a half othese. My momma always said life was like a box ochocol
Java处理JSON dyy_gusi json
Json在数据传输中很好用，原因是JSON 比 XML 更小、更快，更易解析。在Java程序中，如何使用处理JSON，现在有很多工具可以处理，比较流行常用的是google的gson和alibaba的fastjson，具体使用如下： 1、读取json然后处理 class ReadJSON { public static void main(String[] args)
win7下nginx和php的配置 geeksun nginx
1. 安装包准备 nginx : 从nginx.org下载nginx-1.8.0.zip php：从php.net下载php-5.6.10-Win32-VC11-x64.zip， php是免安装文件。 RunHiddenConsole: 用于隐藏命令行窗口 2. 配置 # java用8080端口做应用服务器，nginx反向代理到这个端口即可 p
基于2.8版本redis配置文件中文解释 hongtoushizi redis
转载自： http://wangwei007.blog.51cto.com/68019/1548167 在Redis中直接启动redis-server服务时, 采用的是默认的配置文件。采用redis-server xxx.conf 这样的方式可以按照指定的配置文件来运行Redis服务。下面是Redis2.8.9的配置文
第五章常用Lua开发库3-模板渲染 jinnianshilongnian nginx lua
动态web网页开发是Web开发中一个常见的场景，比如像京东商品详情页，其页面逻辑是非常复杂的，需要使用模板技术来实现。而Lua中也有许多模板引擎，如目前我在使用的lua-resty-template，可以渲染很复杂的页面，借助LuaJIT其性能也是可以接受的。如果学习过JavaEE中的servlet和JSP的话，应该知道JSP模板最终会被翻译成Servlet来执行；而lua-r
JZSearch大数据搜索引擎颠覆者 JavaScript
系统简介：大数据的特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。最后这一点也是和传统的数据挖掘技术有着本质的不同。业界将其归纳为4个“V”——Volume，Variety，Value，Velocity。大数据搜索引
10招让你成为杰出的Java程序员 pda158 java 编程框架
如果你是一个热衷于技术的 Java 程序员，那么下面的 10 个要点可以让你在众多 Java 开发人员中脱颖而出。　　 1. 拥有扎实的基础和深刻理解 OO 原则　　对于 Java 程序员，深刻理解 Object Oriented Programming（面向对象编程）这一概念是必须的。没有 OOPS 的坚实基础，就领会不了像 Java 这些面向对象编程语言
tomcat之oracle连接池配置小网客 oracle
tomcat版本7.0 配置oracle连接池方式：修改tomcat的server.xml配置文件： <GlobalNamingResources> <Resource name="utermdatasource" auth="Container" type="javax.sql.DataSou
Oracle 分页算法汇总 vipbooks oracle sql 算法 .net
这是我找到的一些关于Oracle分页的算法，大家那里还有没有其他好的算法没？我们大家一起分享一下！ -- Oracle 分页算法一 select * from ( select page.*,rownum rn from (select * from help) page -- 20 = (currentPag