Rhine_Yu

网页爬虫入门--莫烦教程笔记

网页爬虫入门–莫烦教程笔记

教程推荐:

莫烦教程–网页爬虫

崔庆才–Python爬虫学习系列教程

知乎问答中的各种推荐

孔淼–一看就明白的爬虫入门讲解

课程逻辑:

网页爬虫 → 解析网页 → 高效爬虫 → 爬虫高级库

爬虫简介

# 用Python登录网页

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)




    
    Scraping tutorial 1 | 莫烦Python
    


    爬虫测试1
    
        这是一个在 莫烦Python
        爬虫教程 中的简单测试.

注意:

我用jupyter notebook做动态交互, 运行时一直报错”ImportError: No module named request”. 但这段代码在IDE上没有出错, 并且这个模块是存在的. 所有我花了一些时间解决这个问题, 在此记录, 以防再次犯错.

首先分析是否是安装问题, 由于我很久没有用jupyter, 再次使用时忘了anaconda自带jupyter,所有用pip和pip3都分别安装了一次. 出问题后又用uninstall卸载, 但是问题没有解决, 依然报错. 并且发现anaconda里有jupyter.
google以及stackoverflow上找办法, 无果, 感觉问题还是出在Python版本上. 于是修改了配置文件, 还是出错.
分析了一下原因, 觉得问题应该出在anaconda自带jupyer上, 其默认的环境不是我正在使用的环境python3.5, 一番周折之后找到解决方案. 原来jupyter的ipykernel是使用一个叫kernel.json的文件管理的, 所以直接为我想要的环境安装ipykernel包:

$ conda install -n python35 ipykernel 
$ python -m ipykernel install --user #激活这个环境

# 匹配网页内容,初级网页匹配使用正则,繁琐匹配推荐使用BeautifulSoup

# 想要找到title的话
import re
res=re.findall(r'(.+?)',html)
print('\nPage title is:',res[0])

# 想要找到中间的那个段落
res=re.findall(r'(.+?)
',html,flags=re.DOTALL) #re.DOTALL if multil line
print('\nPage paragraph is:',res[0])

# 找一找所有的链接
res=re.findall(r'href=(.+?)',html)
print('\nAll links:',res)

Page title is: Scraping tutorial 1 | 莫烦Python

Page paragraph is: 
        这是一个在 莫烦Python
        爬虫教程 中的简单测试.


All links: ['"', '"', '"']

BeautifulSoup解析网页

爬网页流程:
1. 选择要爬的网址(url)
2. 使用Python登录上这个网址(urlopen等)
3. 读取网页信息(read()出来)
4. 将读取的信息放入BeautifulSoup
5. 使用BeautifulSoup选取tag信息等(代替正则表达式)

安装BeautifulSoup

$ pip install beautifulsoup4 # Python 2+ 
$ pip3 install beautifulsoup4 # Python 3+ 
$ conda install beautifulsoup4 # anaconda

BeautifulSoup 4.2.0文档

BeautifulSoup解析网页: 基础

# 按常规读取网页
from bs4 import BeautifulSoup
from urllib.request import urlopen

#if has Chinese, apple decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')
print(soup.h1)
print('\n',soup.p)

# 用find_all找到所有选项,但真正link在里面
# 可以看做是的一个属性,可以用字典形式的key来读取
all_href=soup.find_all('a')
all_href=[l["href"] for l in all_href]
print('\n',all_href)




    
    Scraping tutorial 1 | 莫烦Python
    


    爬虫测试1
    
        这是一个在 莫烦Python
        爬虫教程 中的简单测试.
    



爬虫测试1

 
        这是一个在 莫烦Python
爬虫教程 中的简单测试.
    

 ['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

BeautifulSoup解析网页: CSS

# 先读取页面
from bs4 import BeautifulSoup
from urllib.request import urlopen

html=urlopen(
    "https://morvanzhou.github.io/static/scraping/list.html"
).read().decode('utf-8')
print(html)

# 将网页信息加载进BeautifulSoup,以lxml的形式
soup=BeautifulSoup(html,features='lxml')




    
    爬虫练习 列表 class | 莫烦 Python
    




列表 爬虫练习

这是一个在 莫烦 Python 的 爬虫教程
    里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.


    一月
    
        一月一号
        一月二号
        一月三号
    
    二月
    三月
    四月
    五月

# 按CSS class 匹配
month=soup.find_all('li',{"class":"month"})
for m in month:
    print(m.get_text())

一月
二月
三月
四月
五月

jan=soup.find('ul',{'class':'jan'})
d_jan=jan.find_all('li')
for d in d_jan:
    print(d.get_text())

一月一号
一月二号
一月三号

BeautifulSoup解析网页: 正则表达

from  bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html=urlopen(
    "https://morvanzhou.github.io/static/scraping/table.html"
).read().decode('utf-8')

soup=BeautifulSoup(html,features='lxml')

img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])

https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg

course_links=soup.find_all('a',{'href':re.compile('https://morvan.*')})
for link in course_links:
    print(link['href'])

https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/scraping
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/

小练习: 爬百度百科

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random

# 设置起始页,并将/item/...的网页放在his中,做一个备案,记录我们浏览过的网页
base_url="https://baike.baidu.com"
his=["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

# select the last sub url i 'his', print the title and url
url=base_url+his[-1]
html=urlopen(url).read().decode('utf-8')

soup=BeautifulSoup(html,features='lxml')
print(soup.find('h1').get_text(),'  url: ',his[-1])

网络爬虫   url:  /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

# find valid urls
sub_urls=soup.find_all('a',{'target':'_blank','href':re.compile('/item/(%.{2})+$')})

if len(sub_urls)!=0:
    his.append(random.sample(sub_urls,1)[0]['href'])
else:
    # no valid sub link found
    his.pop()
print(his)

['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E6%8E%92%E5%BA%8F%E7%AE%97%E6%B3%95']

# 放在一个for loop中,让它在各种不同的网页中跳来跳去
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
    url = base_url + his[-1]

    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print(i, soup.find('h1').get_text(), '    url: ', his[-1])

    # find valid urls
    sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

    if len(sub_urls) != 0:
        his.append(random.sample(sub_urls, 1)[0]['href'])
    else:
        # no valid sub link found
        his.pop()

0 网络爬虫     url:  /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
1 搜索引擎     url:  /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
2 全文索引     url:  /item/%E5%85%A8%E6%96%87%E7%B4%A2%E5%BC%95
3 搜索引擎     url:  /item/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
4 网站     url:  /item/%E7%BD%91%E7%AB%99
5 通信协议     url:  /item/%E9%80%9A%E4%BF%A1%E5%8D%8F%E8%AE%AE
6 数据单元     url:  /item/%E6%95%B0%E6%8D%AE%E5%8D%95%E5%85%83
7 数据包     url:  /item/%E6%95%B0%E6%8D%AE%E5%8C%85
8 会话层     url:  /item/%E4%BC%9A%E8%AF%9D%E5%B1%82
9 数据流     url:  /item/%E6%95%B0%E6%8D%AE%E6%B5%81
10 聚类分析     url:  /item/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90
11 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
12 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
13 中数据     url:  /item/%E4%B8%AD%E6%95%B0%E6%8D%AE
14 企业邮箱     url:  /item/%E4%BC%81%E4%B8%9A%E9%82%AE%E7%AE%B1
15 域名     url:  /item/%E5%9F%9F%E5%90%8D
16 商标权     url:  /item/%E5%95%86%E6%A0%87%E6%9D%83
17 商标注册不当     url:  /item/%E5%95%86%E6%A0%87%E6%B3%A8%E5%86%8C%E4%B8%8D%E5%BD%93
18 中国驰名商标     url:  /item/%E9%A9%B0%E5%90%8D%E5%95%86%E6%A0%87
19 商标评审委员会     url:  /item/%E5%95%86%E6%A0%87%E8%AF%84%E5%AE%A1%E5%A7%94%E5%91%98%E4%BC%9A

注意: Python错误记录：IndexError: list index out of range
1. 第1种可能情况: list[index]index超出范围
2. 第2种可能情况: list是一个空的,没有一个元素, 进行list[0]就会出现该错误

加速你的爬虫

多进程分布式

import multiprocessing as mp
import time
from urllib.request import urlopen,urljoin
from bs4 import BeautifulSoup
import re

base_url = 'https://morvanzhou.github.io/'

def crawl(url):
    response=urlopen(url)
    time.sleep(0.1)
    return response.read().decode()

def parse(html):
    soup=BeautifulSoup(html,'lxml')
    urls=soup.find_all('a',{'href':re.compile('^/.+?/$')})
    title=soup.find('h1').get_text().strip()
    page_urls=set([urljoin(base_url,url['href']) for url in urls])
    url=soup.find('meta',{'property':'og:url'})['content']
    return title,page_urls,url

测试普通爬法

unseen = set([base_url,])
seen = set()

if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False

count, t1 = 1, time.time()

while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break

    print('\nDistributed Crawling...')
    htmls = [crawl(url) for url in unseen]

    print('\nDistributed Parsing...')
    results = [parse(html) for html in htmls]

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 53 s

Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 https://morvanzhou.github.io/

Distributed Crawling...

Distributed Parsing...

Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 19.8 s

测试分布式爬法

unseen = set([base_url,])
seen = set()

pool = mp.Pool(4)                       
count, t1 = 1, time.time()
while len(unseen) != 0:                 # still get some url to visit
    if restricted_crawl and len(seen) > 20:
            break
    print('\nDistributed Crawling...')
    crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
    htmls = [j.get() for j in crawl_jobs]                                       # request connection

    print('\nDistributed Parsing...')
    parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
    results = [j.get() for j in parse_jobs]                                     # parse html

    print('\nAnalysing...')
    seen.update(unseen)         # seen the crawled
    unseen.clear()              # nothing unseen

    for title, page_urls, url in results:
        print(count, title, url)
        count += 1
        unseen.update(page_urls - seen)     # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, ))    # 16 s !!!

Distributed Crawling...

Distributed Parsing...

Analysing...
1 教程 https://morvanzhou.github.io/

Distributed Crawling...

Distributed Parsing...

Analysing...
2 数据处理教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/
3 Matplotlib 画图教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/plt/
4 机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/
5 为了更优秀 https://morvanzhou.github.io/support/
6 高级爬虫: 高效无忧的 Scrapy 爬虫库 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-02-scrapy/
7 机器学习实践 https://morvanzhou.github.io/tutorials/machine-learning/ML-practice/
8 网页爬虫教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
9 Why? https://morvanzhou.github.io/tutorials/data-manipulation/scraping/1-00-why/
10 multiprocessing 多进程教程系列 https://morvanzhou.github.io/tutorials/python-basic/multiprocessing/
11 Linux 简易教学 https://morvanzhou.github.io/tutorials/others/linux-basic/
12 关于莫烦 https://morvanzhou.github.io/about/
13 近期更新 https://morvanzhou.github.io/recent-posts/
14 高级爬虫: 让 Selenium 控制你的浏览器帮你爬 https://morvanzhou.github.io/tutorials/data-manipulation/scraping/5-01-selenium/
15 进化算法 Evolutionary Strategies 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/evolutionary-algorithm/
16 Theano 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/theano/
17 Threading 多线程教程系列 https://morvanzhou.github.io/tutorials/python-basic/threading/
18 说吧~ https://morvanzhou.github.io/discuss/
19 有趣的机器学习系列 https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/
20 Python基础 教程系列 https://morvanzhou.github.io/tutorials/python-basic/
21 Keras 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/keras/
22 Pytorch 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/torch/
23 推荐学习顺序 https://morvanzhou.github.io/learning-steps/
24 Git 版本管理 教程系列 https://morvanzhou.github.io/tutorials/others/git/
25 Tkinter GUI 教程系列 https://morvanzhou.github.io/tutorials/python-basic/tkinter/
26 其他教学系列 https://morvanzhou.github.io/tutorials/others/
27 基础教程系列 https://morvanzhou.github.io/tutorials/python-basic/basic/
28 Tensorflow 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
29 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-16-transfer-learning/
30 Numpy & Pandas 教程系列 https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
31 Sklearn 通用机器学习 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/sklearn/
32 强化学习 Reinforcement Learning 教程系列 https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
33 迁移学习 Transfer Learning https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/2-9-transfer-learning/
Total time: 7.8 s

异步加载Asyncio

基本用法

# asyncio: 在单线程里使用异步计算,下载网页的时候和处理网页的时候是不连续的
# 更有效利用了等待下载的这段时间
import time

def job(t):
    print('Start job',t)
    time.sleep(t)
    print('Job',t,' takes',t,'s')

def main():
    [job(t) for t in range(1,3)]

t1=time.time()
main()
print('No async total time: ',time.time()-t1)

Start job 1
Job 1  takes 1 s
Start job 2
Job 2  takes 2 s
No async total time:  3.00662899017334

import asyncio

async def job(t): #async形式的功能
    print('Start job',t)
    await asyncio.sleep(t)  #等待t秒,期间切换其他任务
    print('Job',t,' takes',t,'s')

async def main(loop): #async形式的功能
    tasks=[loop.create_task(job(t)) for t in range(1,3)] #创建任务,但不执行
    await asyncio.wait(tasks) #执行并等待所有任务完成

t1=time.time()
loop=asyncio.get_event_loop() #建立loop
loop.run_until_complete(main(loop)) #执行loop
print("Async total time:", time.time()-t1)

Start job 1
Start job 2
Job 1  takes 1 s
Job 2  takes 2 s
Async total time: 2.0041818618774414

aiohttp

$ pip3 install aiohttp

import requests

URL = 'https://morvanzhou.github.io/'

def normal():
    for i in range(2):
        r=requests.get(URL)
        url=r.url
        print(url)

t1=time.time()
normal()
print('Normal total time:',time.time()-t1)

https://morvanzhou.github.io/
https://morvanzhou.github.io/
Normal total time: 0.6391615867614746

import aiohttp

async def job(session):
    response=await session.get(URL) #等待并切换
    return str(response.url)

async def main(loop):
    async with aiohttp.ClientSession() as session: #官网推荐建立Session的形式
        tasks=[loop.create_task(job(session)) for _ in range(2)] 
        finished,unfinished=await asyncio.wait(tasks)
        all_results=[r.result() for r in finished] #获取所有结果
        print(all_results)

t1=time.time()
loop=asyncio.get_event_loop()
loop.run_until_complete(main(loop))
loop.close()
print('Async total time:',time.time()-t1)

['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/']
Async total time: 0.30881452560424805

和多进程分布式爬虫对比

%E5%9B%BE%E7%89%87.png

import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
from urllib.request import urljoin
import re
import multiprocessing as mp

base_url = "https://morvanzhou.github.io/"

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
    restricted_crawl = True
else:
    restricted_crawl = False


seen = set()
unseen = set([base_url])


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url


async def crawl(url, session):
    r = await session.get(url)
    html = await r.text()
    await asyncio.sleep(0.1)        # slightly delay for downloading
    return html


async def main(loop):
    pool = mp.Pool(8)               # slightly affected
    async with aiohttp.ClientSession() as session:
        count = 1
        while len(unseen) != 0:
            print('\nAsync Crawling...')
            tasks = [loop.create_task(crawl(url, session)) for url in unseen]
            finished, unfinished = await asyncio.wait(tasks)
            htmls = [f.result() for f in finished]

            print('\nDistributed Parsing...')
            parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
            results = [j.get() for j in parse_jobs]

            print('\nAnalysing...')
            seen.update(unseen)
            unseen.clear()
            for title, page_urls, url in results:
                # print(count, title, url)
                unseen.update(page_urls - seen)
                count += 1

if __name__ == "__main__":
    t1 = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(loop))
    #loop.close()
    print("Async total time: ", time.time() - t1)

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

 in ()
     61     t1 = time.time()
     62     loop = asyncio.get_event_loop()
---> 63     loop.run_until_complete(main(loop))
     64     loop.close()
     65     print("Async total time: ", time.time() - t1)


~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in run_until_complete(self, future)
    441         Return the Future's result, or raise its exception.
    442         """
--> 443         self._check_closed()
    444 
    445         new_task = not futures.isfuture(future)


~/anaconda2/envs/python35/lib/python3.5/asyncio/base_events.py in _check_closed(self)
    355     def _check_closed(self):
    356         if self._closed:
--> 357             raise RuntimeError('Event loop is closed')
    358 
    359     def _asyncgen_finalizer_hook(self, agen):


RuntimeError: Event loop is closed

from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import multiprocessing as mp
import re
import time


def crawl(url):
    response = urlopen(url)
    time.sleep(0.1)             # slightly delay for downloading
    return response.read().decode()


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
    title = soup.find('h1').get_text().strip()
    page_urls = set([urljoin(base_url, url['href']) for url in urls])
    url = soup.find('meta', {'property': "og:url"})['content']
    return title, page_urls, url


if __name__ == '__main__':
    base_url = 'https://morvanzhou.github.io/'
    #base_url = "http://127.0.0.1:4000/"

    # DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
    if base_url != "http://127.0.0.1:4000/":
        restricted_crawl = True
    else:
        restricted_crawl = False

    unseen = set([base_url,])
    seen = set()

    pool = mp.Pool(8)                       # number strongly affected
    count, t1 = 1, time.time()
    while len(unseen) != 0:                 # still get some url to visit
        if restricted_crawl and len(seen) > 20:
            break
        print('\nDistributed Crawling...')
        crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
        htmls = [j.get() for j in crawl_jobs]                                       # request connection
        htmls = [h for h in htmls if h is not None]     # remove None

        print('\nDistributed Parsing...')
        parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
        results = [j.get() for j in parse_jobs]                                     # parse html

        print('\nAnalysing...')
        seen.update(unseen)
        unseen.clear()

        for title, page_urls, url in results:
            # print(count, title, url)
            count += 1
            unseen.update(page_urls - seen)

    print('Total time: %.1f s' % (time.time()-t1, ))

高级爬虫

高级爬虫:让Selenium控制你的浏览器帮你爬

Selenium: 能控制你的浏览器,有模有样地学人类”看”网页

什么时候用到Selenium:
- 发现用普通方法爬不到想要的内容
- 网站跟你玩”捉迷藏”,太多JavaScript内容
- 需要像人一样浏览的爬虫

安装Selenium

$ pip install selenium #python 2+

$ pip install selenium #python 3+

下载针对Linux和macOS的driver:

Chrome driver
Edge driver
Firefox driver
Safari driver

Linux和MacOS用户下载好之后, 需要将下载好的”geckodriver”文件放在你的计算机的”/usr/bin”或者”/usr/local/bin”目录, 在终端执行:

$ sudo cp 你的geckodriver位置 /usr/local/bin

$ sudo chmod +x /usr/local/bin/geckodriver

下载火狐浏览器插件Katalon Recorder,记录重复性工作

点击插件, 开始各种点击工作, 每当点击时, 插件就会记录下这些点击, 形成一些log, 最后点击Export按钮, 观看帮你生成的浏览记录代码, 可以输出Python2版本的代码, 复制你需要的代码.

import os

os.makedirs('./images/',exist_ok=True)

from selenium import webdriver

driver=webdriver.Firefox() #打开火狐浏览器

# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot1.png')
driver.close()
print(html[:200])


  
  
   
  # 让selenium不弹出浏览器窗口,让它安静地执行这些操作
from selenium.webdriver.firefox.options import Options

firefox_options=Options()
firefox_options.add_argument('--headless') #define headless

driver=webdriver.Firefox(firefox_options=firefox_options)

# 用Katalon复制的内容
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

# 得到网页html,还能截图
html=driver.page_source
driver.get_screenshot_as_file('./images/sreenshot2.png')
driver.close()
print(html[:200]) 
  
  
  
  高级爬虫: 高效无忧的Scrapy爬虫库 
  $ pip3 install scrapy 
  Scrapy是一个爬虫的框架, 而不是一个简单的爬虫, 它整合了爬取, 处理数据, 存储数据的一条龙服务. 
  import scrapy

class MofanSpider(scrapy.Spider):
    name = "mofan"
    start_urls = [
        'https://morvanzhou.github.io/',
    ]
    # unseen = set()
    # seen = set()      # we don't need these two as scrapy will deal with them automatically

    def parse(self, response):
        yield {     # return some results
            'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
            'url': response.url,
        }

        urls = response.css('a::attr(href)').re(r'^/.+?/$')     # find all sub urls
        for url in urls:
            yield response.follow(url, callback=self.parse)     # it will filter duplication automatically


# lastly, run this in terminal
# scrapy runspider 5-2-scrapy.py -o res.json

2021-05-21 python中curses基本用法 zerfew python curses cli
有时候linux系统没有界面，可能需要在terminal终端完成程序的交互和状态显示，C语言的ncurses支持命令行界面程序开发，curses是基于ncurses实现的python终端界面库。本文实现一个简单的demo小程序，方便初学者学习和使用。首先上demo主程序#-*-coding:UTF-8-*-importcursesimportlocalelocale.setlocale(local
Fatal Python error: initfsencoding: unable to load the file system codec 珞珈山小裁缝11-8 python
FatalPythonerror:initfsencoding:unabletoloadthefilesystemcodecModuleNotFoundError:Nomodulenamed'encodings'Currentthread0x00007668(mostrecentcallfirst):问题原因：python路径错误我是直接运行exe程序（几乎没有配置python环境），我的pyth
Python中三种表示NA的方式风语者666 python
Python中三种表示NA的方式#-*-coding:utf-8-*-importnumpyasnpimportpandasaspd#data_frame=np.load('a.npy',allow_pickle=True)#print(data_frame.columns)df=pd.DataFrame({'one':[1,2,3,pd.NA]})df=pd.DataFrame({'one':[
DeepSeek面试——分词算法 mzgong 人工智能算法
DeepSeek-V3分词算法一、核心算法：字节级BPE（Byte-levelBPE，BBPE）DeepSeek-V3采用字节级BPE（BBPE）作为核心分词算法，这是对传统BPE（BytePairEncoding）算法的改进版本。其核心原理是将文本分解为字节（Byte）序列，通过统计高频相邻字节对的共现频率进行逐层合并，最终形成128K扩展词表。二、BBPE的核心优势1.多语言统一处理能力跨语言
北本海硕腾讯二面没过，该如何准备才能再战大厂？程序员yt 面试
今天给大家分享的是一位粉丝的提问，北本海硕腾讯二面没过，该如何准备才能再战大厂？接下来把粉丝的具体提问和我的回复分享给大家，希望也能给一些类似情况的小伙伴一些启发和帮助。同学提问：bg北本海硕两段大厂实习，感觉自己bg其实不错但问题是实习干的都是些crud当时也没意识记录一些更深刻的技术或者架构上的问题，八股又欠背(大致看完了小林codinga和csguilde的c++指北但实际中延伸出来的问题太
HMML——3D AI Coding的基础语言 AIGC5D-Longan 人工智能
编程语言（如Python、Java、C++等），作为2D编程的语言，也是AI开发的主力工具。2D编程语言内容呈现和交互，与3D世界、物理世界的高维复杂性之间的割裂日益凸显。HMML（超多元空间标记语言HyperMultspaceMarkupLanguage），是新的3D编程语言，也是3DAICoding的基础语言。3DAICoding的诞生，标志编程语言首次实现与人类多维认知的深度对齐。通过HMM
python/R 连接 clickhouse weixin_41283198 python clickhouse r语言 python 大数据 r语言
1、python-clickhouseimportnumpyasnpfromclickhouse_driverimportClientimportpandasaspdsql=open('/opt/check_detect_local.sql','r',encoding='utf8')sqltxt=sql.readlines()print(len(sqltxt))sqls=[]foriinnp.ar
爬虫中一些有用的用法才不是小emo的小杨爬虫 xpath
文本和标签在一个级别下如果文本和a标签在一个级别下比如：#获取a标签后的第一个文本节点text_node=a.xpath('following-sibling::text()[1]')[0].strip()将xpath的html代码转换成字符串etree.tostring(root,pretty_print=True,encoding="utf-8")获取所有同级标签的最后一个data_list=
python爬虫网络中断_如何解决Python爬虫中的网络掉线问题？ weixin_39767645 python爬虫网络中断
在学校里的时候，除了上课，还有一大幸福的事情，就是用着学校的网线网络。当然玩的时候很开心，就是没事关键词时刻掉链子。时不时地网络掉线让人非常恼火，什么团战在梦游啊，看剧卡住不动了，相信能引起很多小伙伴的共鸣。所以，为了大家的快乐，小编找到了一个解决办法，分享给大家。以山东大学网络为例，别的话不多说，直接上程序__author__='CQC'#-*-coding:utf-8-*-importurll
解决问题：Android Studio启动不了 piggy514 android studio android ide
1、启动不了之前的操作：build报错的提示的都是乱码，于是网上搜了下，去菜单Help>EditCustomVMoptions此时AS打开了AS安装目录下bin/studio64.exe.vmoptions这个文件根据网上说法在里边加一句-Dfile.encoding=UTF-8即可，于是加了，结果AS闪退。再也启动不了，重启电脑后也不行，重新安装也不行。看来有时不要轻信网上操作。怎么解决乱码问题
C++实现哈夫曼编码的技术详解金外飞176 算法 c++开发语言
C++实现哈夫曼编码的技术详解哈夫曼编码（HuffmanCoding）是一种基于字符出现频率的无损数据压缩算法，由DavidA.Huffman在1952年提出。它通过构建最优二叉树（哈夫曼树）为字符分配变长编码，使得高频字符使用较短的编码，低频字符使用较长的编码，从而实现数据的高效压缩。本文将详细介绍哈夫曼编码的原理，并通过C++代码实现其核心功能。1.哈夫曼编码的基本原理哈夫曼编码的核心思想是贪
基于python 利用ERA5 资料绘制水汽剖面图 happycatherin python numpy matplotlib
#-*-coding:utf-8-*-"""CreatedonMonApr309:28:072023@author:PC"""#-*-coding:utf-8-*-"""CreatedonMonJul1116:54:302022@author:PC"""importcartopy.crsasccrsimportcartopy.featureascfeatureimportmatplotlib.py
python中的文件操作 Mswanga python python 开发语言
1.创建文件python中使用open()函数创建或者打开文件，语法格式：open(file,mode='r',buffering=-1,encoding=None,errors=None,newline=None,closefd=True,opener=None)file：表示要打开的文件的路径，也可以是被封装的整数类型文件描述符mode：用于指定文件的打开模式，默认是’r‘（以文本模式打开并且
20250310：OpenCV mat对象与base64互转微风❤水墨 AI模型部署 Mat转base64
代码：https://github.com/ReneNyffenegger/cpp-base64指南：https://renenyffenegger.ch/notes/development/Base64/Encoding-and-decoding-base-64-with-cpp/实操：
python远程连接mysql数据库_python远程连接MySQL数据库 weixin_39528697
python远程连接MySQL数据库本文实例为大家分享了python远程连接MySQL数据库的具体代码，供大家参考，具体内容如下连接数据库这里默认大家都已经配置安装好MySQL和Python的MySQL模块，且默认大家的DB内表和访问账号权限均已设置无误，下面直接代码演示：#-*-coding:utf-8-*-"""CreatedonFriDec3010:43:352016@author:zhen
视频文件的几个关键参数 buleideli Android Camera camera android
参数的解释VideoEncodingBitRate（视频编码比特率）比特率是指每秒钟视频使用的数据量，通常以bps（bitspersecond）为单位。比特率越高，视频质量越好，但同时也会导致文件体积增大。比特率直接影响视频文件的大小，是影响最大的因素之一。高比特率意味着更高的画质和更大的文件尺寸。VideoFrameRate（视频帧率）帧率指的是每秒钟显示的图像帧数，通常用fps（framesp
requests入门以及requests库实例和with,os的解释（Python网络爬虫和信息提取）眸生 Python爬虫 python 爬虫开发语言笔记
导学定向网络数据爬取和网页解析的基本能力requests入门安装方法首先cmdpipinstallrequests然后打开idle测试**>>>importrequests>>>r=requests.get("http://www.baidu.com")>>>r.status_code200>>>r.encoding='utf-8'>>>r.text**requests库的7个主要方法reques
python之使用scapy扫描本机局域网主机，输出IP/MAC表敲键盘的Q python
安装scapy库pipinstallscapy-ihttps://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple扫描本机局域网的所有主机，输出IP/MAC对于表#-*-coding:UTF-8-*-importnetifacesfromscapy.allimportsrpfromscapy.layers.l2importARP,Etherimportipa
效果媲美Cursor的开源替代：Roo-Cline 开源项目精选 github 开源
Roo-Cline是一个基于Cline的增强版AICoding扩展插件。作者在Cline项目的基础上做了能力增强，并增加了多项实验性功能，包括多语言支持、图片拖放、消息管理等创新特性。作为一个开源项目，它在保持原版Cline所有核心功能的同时，通过持续的社区贡献不断改进和扩展其功能集。Stars数2353Forks数177主要特点命令、写入、浏览器操作的自动审批功能支持每个项目的.clinerul
Python解析PDF：支持本地/在线文档的解析、提取文本及表格信息（采用pdfplumber包）二师父 #Python【文件相关】python
番外话被PDF折磨了两年多，今天终于找到一个比较好的解析方案，是用pdfplumber包解析的，并做了一些封装。之前用pdfminer解析的效果很一般，提取效果无法忍受的那种。把解析方法写出来后，我自己也是老泪纵横，给大家分享一下注意事项pdfplumber是对pdfminer的再封装，最好先安装pdfminer3k，再安装pdfplumber，否则代码很容易报错源码#-*-coding:utf-
携程开源的分布式apollo技术，整合springboot集成实现动态刷新配置 2401_84584854 程序员 java 面试学习
最后这份文档从构建一个键值数据库的关键架构入手，不仅带你建立起全局观，还帮你迅速抓住核心主线。除此之外，还会具体讲解数据结构、线程模型、网络框架、持久化、主从同步和切片集群等，帮你搞懂底层原理。相信这对于所有层次的Redis使用者都是一份非常完美的教程了。整理不易，觉得有帮助的朋友可以帮忙点赞分享支持一下小编~你的支持，我的动力；祝各位前程似锦，offer不断！！！本文已被CODING开源项目：【
Python爬虫实战010：反爬取机制学习若北辰 Python爬虫教程 python 爬虫开发语言
#-*-coding:utf-8-*-"""@ModuleName:demo_001@Function:@Author:@Time:2020/12/28上午11:21"""fromlxmlimportetreeimportpandasaspdimportreimportrandomimporturllibimportrequestsimporttimeimportosimportjson
python动态SQL并执行查询 IT-例子 python python sql 数据库
python动态SQL#coding=utf-8#sqlserver的连接importpymssqlimportdatetimeimporttimestart=time.perf_counter()print('程序正在运行,请稍等...')print("数据读取中...")today_now=datetime.datetime.now()print("现在时间是：",today_now)"""这
【LLM】从零开始实现 LLaMA3 FOUR_A LLM 人工智能机器学习大模型 llama 算法
分词器在这里，我们不会实现一个BPE分词器（但AndrejKarpathy有一个非常简洁的实现）。BPE（BytePairEncoding，字节对编码）是一种数据压缩算法，也被用于自然语言处理中的分词方法。它通过逐步将常见的字符或子词组合成更长的词元（tokens），从而有效地表示文本中的词汇。在自然语言处理中的BPE分词器的工作原理如下：初始化：首先，将所有词汇表中的单词分解为单个字符或符号。例
数字IP转换成字符串IP 故事里故去 C#C#字符串处理时间性能 IP地址构造字节操作
DateTimelulu=DateTime.Now;byte[][]data=newbyte[256][];for(inti=0;i<256;i++){data[i]=Encoding.Default.GetBytes("."+i.ToString());}byte[]buff1=newbyte[4];buff1[0]=230;buff1[1]=220;buff1[2]=123;buff1[3]=
python系列【仅供参考】：python3 生成pdf 中文乱码问题处理坦笑&&life #python python pdf 开发语言
python3生成pdf中文乱码问题处理python3生成pdf中文乱码问题处理1.首先上代码：2.乱码原因：3.安装字体库4.找一台安装了中文字体的服务器python3生成pdf中文乱码问题处理1.首先上代码：importpdfkit#urlPath是待导出的链接pdfkit.from_url(urlPath,'test.pdf',options={'encoding':'UTF-8'
python爬虫（7）爬虫实例（3）丁叔叔爬虫实例
#-*-coding:utf-8-*-importrequestsimportosfromlxmlimportetree#解析库XPath#在本地建立一个文件夹，命名为pic_truck，用于存放下载的图片folder='pic_truck'ifnotos.path.exists(folder):os.makedirs(folder)#定义下载函数，用于下载图片defdownload(url):r
Python爬虫之爬取酷狗音乐进击的Loser‭
Python爬虫之爬取酷狗音乐废话不说，上代码：#!Python#-*-encoding:utf-8-*-'''1.文件名称:酷我音乐爬虫.py2.创建时间:2021/03/2117:29:093.作者名称:ZAY4.Python版本:3.7.0'''importosimportgetpassimportrequestsfromurllib.parseimportquoteclassSpider(
java.sql.SQLNonTransientConnectionException: Public Key Retrieval is not allowed 二十七剑 java 开发语言
只需要在url:jdbc:mysql://xxx?serverTimezone=Asia/Shanghai&useUnicode=true&characterEncoding=utf8&useSSL=false后面加上&allowPublicKeyRetrieval=true即url:jdbc:mysql://xxx?serverTimezone=Asia/Shanghai&useUnicode=
接口自动化如何封装mysql操作天才测试猿 mysql python 软件测试测试工具测试用例自动化测试数据库
数据查询类封装1.功能分析可以连接不同sql数据库查一条数据，多条数据可以获取不同格式的数据2.封装成数据库查询类封装思路:数据库查询模块有多个功能，且需要复用，所以封装成类创建对象方法实现各种查询在构造方法中创建连接废话不多说，直接上代码！！！#-*-coding:utf-8-*-#@Time:2019/11/1314:51#@Author:kira#@Email:[email protected]
Java实现的基于模板的网页结构化信息精准抽取组件：HtmlExtractor yangshangchuan 信息抽取 HtmlExtractor 精准抽取信息采集
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件，本身并不包含爬虫功能，但可被爬虫或其他程序调用以便更精准地对网页结构化信息进行抽取。 HtmlExtractor是为大规模分布式环境设计的，采用主从架构，主节点负责维护抽取规则，从节点向主节点请求抽取规则，当抽取规则发生变化，主节点主动通知从节点，从而能实现抽取规则变化之后的实时动态生效。如
java编程思想 -- 多态百合不是茶 java 多态详解
一: 向上转型和向下转型面向对象中的转型只会发生在有继承关系的子类和父类中（接口的实现也包括在这里）。父类：人子类：男人向上转型： Person p = new Man() ; //向上转型不需要强制类型转化向下转型： Man man =
[自动数据处理]稳扎稳打,逐步形成自有ADP系统体系 comsci dp
对于国内的IT行业来讲,虽然我们已经有了"两弹一星",在局部领域形成了自己独有的技术特征,并初步摆脱了国外的控制...但是前面的路还很长.... 首先是我们的自动数据处理系统还无法处理很多高级工程...中等规模的拓扑分析系统也没有完成,更加复杂的
storm 自定义日志文件商人shang storm cluster logback
Storm中的日志级级别默认为INFO，并且，日志文件是根据worker号来进行区分的，这样，同一个log文件中的信息不一定是一个业务的，这样就会有以下两个需求出现： 1. 想要进行一些调试信息的输出 2. 调试信息或者业务日志信息想要输出到一些固定的文件中不要怕，不要烦恼，其实Storm已经提供了这样的支持，可以通过自定义logback 下的 cluster.xml 来输
Extjs3 SpringMVC使用 @RequestBody 标签问题记录 21jhf
springMVC使用 @RequestBody(required = false) UserVO userInfo 传递json对象数据，往往会出现http 415，400,500等错误，总结一下需要使用ajax提交json数据才行，ajax提交使用proxy，参数为jsonData，不能为params；另外，需要设置Content-type属性为json，代码如下：（由于使用了父类aaa
一些排错方法文强chu 方法
1、java.lang.IllegalStateException: Class invariant violation at org.apache.log4j.LogManager.getLoggerRepository(LogManager.java:199)at org.apache.log4j.LogManager.getLogger(LogManager.java:228) at o
Swing中文件恢复我觉得很难小桔子 swing
我那个草了！老大怎么回事，怎么做项目评估的？只会说相信你可以做的，试一下，有的是时间！用java开发一个图文处理工具，类似word，任意位置插入、拖动、删除图片以及文本等。文本框、流程图等，数据保存数据库，其余可保存pdf格式。ok,姐姐千辛万苦，
php 文件操作 aichenglong PHP 读取文件写入文件
1 写入文件 @$fp=fopen("$DOCUMENT_ROOT/order.txt", "ab"); if(!$fp){ echo "open file error" ; exit; } $outputstring="date:"." \t tire:".$tire."
MySQL的btree索引和hash索引的区别 AILIKES 数据结构 mysql 算法
Hash 索引结构的特殊性，其检索效率非常高，索引的检索可以一次定位，不像B-Tree 索引需要从根节点到枝节点，最后才能访问到页节点这样多次的IO访问，所以 Hash 索引的查询效率要远高于 B-Tree 索引。可能很多人又有疑问了，既然 Hash 索引的效率要比 B-Tree 高很多，为什么大家不都用 Hash 索引而还要使用 B-Tree 索引呢
JAVA的抽象--- 接口 --实现百合不是茶
抽象接口实现接口 //抽象类 ,方法 //定义一个公共抽象的类 ,并在类中定义一个抽象的方法体抽象的定义使用abstract abstract class A 定义一个抽象类例如： //定义一个基类 public abstract class A{ //抽象类不能用来实例化，只能用来继承 //
JS变量作用域实例 bijian1013 作用域
<script> var scope='hello'; function a(){ console.log(scope); //undefined var scope='world'; console.log(scope); //world console.log(b);
TDD实践（二） bijian1013 java TDD
实践题目：分解质因数 Step1：单元测试： package com.bijian.study.factor.test; import java.util.Arrays; import junit.framework.Assert; import org.junit.Before; import org.junit.Test; import com.bijian.
[MongoDB学习笔记一]MongoDB主从复制 bit1129 mongodb
MongoDB称为分布式数据库，主要原因是1.基于副本集的数据备份， 2.基于切片的数据扩容。副本集解决数据的读写性能问题，切片解决了MongoDB的数据扩容问题。事实上，MongoDB提供了主从复制和副本复制两种备份方式，在MongoDB的主从复制和副本复制集群环境中，只有一台作为主服务器，另外一台或者多台服务器作为从服务器。本文介绍MongoDB的主从复制模式，需要指明
【HBase五】Java API操作HBase bit1129 hbase
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.ha
python调用zabbix api接口实时展示数据 ronin47
zabbix api接口来进行展示。经过思考之后，计划获取如下内容： 1、获得认证密钥 2、获取zabbix所有的主机组 3、获取单个组下的所有主机 4、获取某个主机下的所有监控项
jsp取得绝对路径 byalias 绝对路径
在JavaWeb开发中，常使用绝对路径的方式来引入JavaScript和CSS文件，这样可以避免因为目录变动导致引入文件找不到的情况，常用的做法如下：一、使用${pageContext.request.contextPath} 　　代码” ${pageContext.request.contextPath}”的作用是取出部署的应用程序名，这样不管如何部署，所用路径都是正确的。
Java定时任务调度：用ExecutorService取代Timer bylijinnan java
《Java并发编程实战》一书提到的用ExecutorService取代Java Timer有几个理由，我认为其中最重要的理由是：如果TimerTask抛出未检查的异常，Timer将会产生无法预料的行为。Timer线程并不捕获异常，所以 TimerTask抛出的未检查的异常会终止timer线程。这种情况下，Timer也不会再重新恢复线程的执行了;它错误的认为整个Timer都被取消了。此时，已经被
SQL 优化原则 chicony sql
一、问题的提出　在应用系统开发初期，由于开发数据库数据比较少，对于查询SQL语句，复杂视图的的编写等体会不出SQL语句各种写法的性能优劣，但是如果将应用系统提交实际应用后，随着数据库中数据的增加，系统的响应速度就成为目前系统需要解决的最主要的问题之一。系统优化中一个很重要的方面就是SQL语句的优化。对于海量数据，劣质SQL语句和优质SQL语句之间的速度差别可以达到上百倍，可见对于一个系统
java 线程弹球小游戏 CrazyMizzz java 游戏
最近java学到线程，于是做了一个线程弹球的小游戏，不过还没完善这里是提纲 1.线程弹球游戏实现 1.实现界面需要使用哪些API类 JFrame JPanel JButton FlowLayout Graphics2D Thread Color ActionListener ActionEvent MouseListener Mouse
hadoop jps出现process information unavailable提示解决办法 daizj hadoop jps
hadoop jps出现process information unavailable提示解决办法 jps时出现如下信息： 3019 -- process information unavailable3053 -- process information unavailable2985 -- process information unavailable2917 --
PHP图片水印缩放类实现 dcj3sjt126com PHP
<?php class Image{ private $path; function __construct($path='./'){ $this->path=rtrim($path,'/').'/'; } //水印函数，参数：背景图，水印图，位置，前缀,TMD透明度 public function water($b,$l,$pos
IOS控件学习：UILabel常用属性与用法 dcj3sjt126com ios UILabel
参考网站： http://shijue.me/show_text/521c396a8ddf876566000007 http://www.tuicool.com/articles/zquENb http://blog.csdn.net/a451493485/article/details/9454695 http://wiki.eoe.cn/page/iOS_pptl_artile_281
完全手动建立maven骨架 eksliang java eclipse Web
建一个 JAVA 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=App [-Dversion=0.0.1-SNAPSHOT] [-Dpackaging=jar] 建一个 web 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=web-a
配置清单 gengzg 配置
1、修改grub启动的内核版本 vi /boot/grub/grub.conf 将default 0改为1 拷贝mt7601Usta.ko到/lib文件夹拷贝RT2870STA.dat到 /etc/Wireless/RT2870STA/文件夹拷贝wifiscan到bin文件夹，chmod 775 /bin/wifiscan 拷贝wifiget.sh到bin文件夹，chm
Windows端口被占用处理方法 huqiji windows
以下文章主要以80端口号为例，如果想知道其他的端口号也可以使用该方法..........................1、在windows下如何查看80端口占用情况?是被哪个进程占用?如何终止等. 这里主要是用到windows下的DOS工具,点击"开始"--"运行",输入&
开源ckplayer 网页播放器，跨平台(html5, mobile)，flv, f4v, mp4, rtmp协议. webm, ogg, m3u8 ！天梯梦 mobile
CKplayer，其全称为超酷flv播放器，它是一款用于网页上播放视频的软件，支持的格式有：http协议上的flv,f4v,mp4格式，同时支持rtmp视频流格式播放，此播放器的特点在于用户可以自己定义播放器的风格，诸如播放/暂停按钮，静音按钮，全屏按钮都是以外部图片接口形式调用，用户根据自己的需要制作出播放器风格所需要使用的各个按钮图片然后替换掉原始风格里相应的图片就可以制作出自己的风格了，
简单工厂设计模式 hm4123660 java 工厂设计模式简单工厂模式
简单工厂模式（Simple Factory Pattern）属于类的创新型模式，又叫静态工厂方法模式。是通过专门定义一个类来负责创建其他类的实例，被创建的实例通常都具有共同的父类。简单工厂模式是由一个工厂对象决定创建出哪一种产品类的实例。简单工厂模式是工厂模式家族中最简单实用的模式，可以理解为是不同工厂模式的一个特殊实现。
maven笔记 zhb8015 maven
跳过测试阶段： mvn package -DskipTests 临时性跳过测试代码的编译： mvn package -Dmaven.test.skip=true maven.test.skip同时控制maven-compiler-plugin和maven-surefire-plugin两个插件的行为，即跳过编译，又跳过测试。指定测试类 mvn test
非mapreduce生成Hfile，然后导入hbase当中 Stark_Summer map hbase reduce Hfile path实例
最近一个群友的boss让研究hbase，让hbase的入库速度达到5w+/s，这可愁死了，4台个人电脑组成的集群，多线程入库调了好久，速度也才1w左右，都没有达到理想的那种速度，然后就想到了这种方式，但是网上多是用mapreduce来实现入库，而现在的需求是实时入库，不生成文件了，所以就只能自己用代码实现了，但是网上查了很多资料都没有查到，最后在一个网友的指引下，看了源码，最后找到了生成Hfile
jsp web tomcat 编码问题王新春 tomcat jsp pageEncode
今天配置jsp项目在tomcat上，windows上正常，而linux上显示乱码，最后定位原因为tomcat 的server.xml 文件的配置，添加 URIEncoding 属性： <Connector port="8080" protocol="HTTP/1.1" connectionTi

网页爬虫入门--莫烦教程笔记

爬虫简介

爬虫测试1

BeautifulSoup解析网页

安装BeautifulSoup

BeautifulSoup解析网页: 基础

爬虫测试1

爬虫测试1

BeautifulSoup解析网页: CSS

列表爬虫练习

BeautifulSoup解析网页: 正则表达

小练习: 爬百度百科

更多请求/下载方式

多功能的Requests

安装requests

reguests get 请求

requests post 请求

上传图片

登录

Welcome to the Website!

使用Session登录

Welcome to the Website!

下载文件

使用urlretrieve

使用request

小练习: 下载美图

加速你的爬虫

多进程分布式

测试普通爬法

测试分布式爬法

异步加载Asyncio

基本用法

aiohttp

和多进程分布式爬虫对比

高级爬虫

高级爬虫:让Selenium控制你的浏览器帮你爬

安装Selenium

你可能感兴趣的:(coding)

网页爬虫入门--莫烦教程笔记

爬虫简介

爬虫测试1

BeautifulSoup解析网页

安装BeautifulSoup

BeautifulSoup解析网页: 基础

爬虫测试1

爬虫测试1

BeautifulSoup解析网页: CSS

列表 爬虫练习

BeautifulSoup解析网页: 正则表达

小练习: 爬百度百科

更多请求/下载方式

多功能的Requests

安装requests

reguests get 请求

requests post 请求

上传图片

登录

Welcome to the Website!

使用Session登录

Welcome to the Website!

下载文件

使用urlretrieve

使用request

小练习: 下载美图

加速你的爬虫

多进程分布式

测试普通爬法

测试分布式爬法

异步加载Asyncio

基本用法

aiohttp

和多进程分布式爬虫对比

高级爬虫

高级爬虫:让Selenium控制你的浏览器帮你爬

安装Selenium

你可能感兴趣的:(coding)

列表爬虫练习