程序员阿城

Python3 爬虫知识梳理(基础篇)

import urllib
import urllib.request
urllib.request.urlopen("http://www.baidu.com")

2.re

3.requests

4.selenimu

这个库是配合一些驱动去爬取动态渲染网页的库

(1)chromedriver

我们使用的时候需要先下载一个 chromedriver.exe ，下载好了以后放在 chrome.exe 的相同目录下（默认安装路径），然后将这个目录放作为 PATH

import selenium
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
driver.page_source

这种方式的唯一的缺点是会出现浏览器界面，这可能是我们不需要的,所以我们可以使用 headless 的方式来隐藏 web 界面(其实就是使用 options() 对象的 add_argument 属性去设置 headless 参数 )

我自己是一名高级python开发工程师，我有整理了一套最新的python系统学习教程，想要学习Python的小伙伴可以加我的python学习交流群：835017344，找管理员即可免费领取。

import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.add_argument("--headless")

base_url = "http://www.baidu.com/"
#对应的chromedriver的放置目录
driver = webdriver.Chrome(executable_path=(r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'), chrome_options=chrome_options)

driver.get(base_url + "/")

start_time=time.time()
print('this is start_time ',start_time)

driver.find_element_by_id("kw").send_keys("selenium webdriver")
driver.find_element_by_id("su").click()
driver.save_screenshot('screen.png')

driver.close()

end_time=time.time()
print('this is end_time ',end_time)

(2)phantomJS

这是另一种无界面的实现方法，虽然说不维护了，并且在使用的过程中会出现各种玄学，但是还是要介绍一下

和 Chromedriver 一样，我们首先要去下载 phantomJS,然后将其放在 PATH 中方便我们后面的调用

import selenium
from selenium import webdriver

driver = webdriver.phantomJS()
driver.get("http://www.baidu.com")
driver.page_source

5.lxml

这个是为 XPATH 的使用准备的库

6.beautifulsoup

pip 安装的时候注意一下要安装 beautifulsoup4,表示第四个版本，并且这个库是依赖于 lxml 的，所以安装之前请先安装 lxml

from bs4 import BeautifulSoup
soup = BeautifulSoup('`','lxml')

7.pyquery

和 BeautifulSoup 一样也是一个网页解析库，但是相对来讲语法简单一些（语法是模仿 jQuery 的）

from pyquery import PyQuery as pq

page = pq('`hello world`')
result = page('html').text()
result

8.pymysql

这个库是 py 操纵 Mysql 的库

import pymysql

conn = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='test')
cursor = conn.cursor()
result = cursor.execute('select * from user where id = 1')
print(cursor.fetchone())

9.pymango

import pymango

client = pymango.MongoClient('localhost')
db = client('newtestdb')
db['table'].insert({'name':'Bob'})
db['table'].find_one({'name':'Bob'})

10.redis

import redis

r = redis.Redis('localhost',6379)
r.set("name","Bob")
r.get('name')

11.flask

flask 在后期使用代理的时候可能会用到

from flask import Flask

app = Flask(__name__)

@app.route('/')

def hello():
    return "hello world"

if __name__ == '__main__':

    app.run(debug=True)

12.django

在分布式爬虫的维护方面可能会用到 django

13.jupyter

网页端记事本

0X02 基础部分

1.爬虫基本原理

(1)爬虫是什么

爬虫就是请求网页并且提取数据的自动化工具

(2)爬虫的基本流程

1.发起请求：

通过 HTTP 库向目标网站发起请求，即发送一个 request（可以包含额外的header信息），然后等待服务器的响应

2.获取响应内容

如果服务器正常响应，会得到一个 Response.其内容就是所要获取的页面的内容，类型可以是 HTML、JSON、二进制数据(图片视频)等

3.解析内容

对 HTML 数据可以使用正则表达式、网页解析库进行解析。如果是 Json 则可以转化成 JSON 对象解析，如果是二进制数据可以保存或者进一步处理

4.保存数据

保存的形式多样，可以是纯文本，也可以保存成数据库，或者保存为特定格式的文件

(3)请求的基本元素

1.请求方法

2.请求 URL

3.请求头

4.请求体(POST 方法独有)

(4)请响应的基本元素

1.状态码

2.响应头

3.响应体

(5)实例代码：

1.请求网页数据

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
res = requests.get("http://www.baidu.com",headers=headers)

print(res.status_code)
print(res.headers)
print(res.text)

当然这里使用的是 res.text 这种文本格式，如果返回的是一个二进制格式的数据(比如图片)，那么我们应该使用 res.content

2.请求二进制数据

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
res = requests.get("https://ss2.bdstatic.com/lfoZeXSm1A5BphGlnYG/icon/95486.png",headers=headers)
print(res.content)

with open(r'E:\桌面\1.png','wb') as f:
    f.write(res.content)
    f.close()

(6)解析方式

1.直接处理

2.转化成 json对象

3.正则匹配

4.BeautifulSoap

5.PyQuery

6.XPath

(7)response 的结果为什么和浏览器中的看到的不同

我们使用脚本去请求(只是一次请求)网页得到的是最原始的网页的源码，这个源码里面会有很多的远程的 js 和 css 的加载，我们的脚本是没法解析的，但是浏览器能对这些远程的链接进行再次的请求，然后利用加载到的数据对页面进行进一步的加载和渲染，于是我们在浏览器中看到的页面是很多请求渲染得到的结果，因此和我们一次请求的到的页面肯定是不一样的。

(8)如何解决 JS 渲染的问题

解决问题的方法本质上就是模拟浏览器的加载渲染，然后将渲染好的页面进行返回

1.分析Ajax 请求

2.selenium+webdriver(推荐)

3.splash

4.PyV8、Ghostpy

(9)如何存储数据

1.纯本文

2.关系型数据库

3.非关系型数据库

4.二进制文件

0X03 Urllib 库

1.什么是 Urllib 库

这个库是 python 的内置的一个请求库

urllib.request —————–>请求模块

urllib.error——————–>异常处理模块

urllib.parse——————–>url解析模块

urllib.robotparser ————>robots.txt 解析模块

2.urllib 库的基本使用

(1)函数调用原型

urllib.request.urlopen(url,data,timeout...)

(2)实例代码一：GET 请求

import urllib.request

res = urllib.request.urlopen("http://www.baidu.com")
print(res.read().decode('utf-8'))

(3)实例代码二：POST 请求

import urllib.request
import urllib.parse
from pprint import pprint

data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding = 'utf8')
res = urllib.request.urlopen('https://httpbin.org/post',data = data)
pprint(res.read().decode('utf-8'))

(4)实例代码三：超时设置

import urllib.request

res = urllib.request.urlopen("http://httpbin.org.get",timeout = 1)
print(res.read().decode('utf-8'))

(5)实例代码：获取响应状态码、响应头、响应体

import urllib.request

res = urllib.request.urlopen("http://httpbin.org/get")
print(res.status)
print(res.getheaders())
print(res.getheader('Server'))
#获取响应体的使用 read() 的结果是 Bytes 类型，我们还要用 decode('utf-8')转化成字符串
print(res.read().decode('utf-8'))

(6) request 对象

from urllib import request,parse
from pprint import pprint

url = "https://httpbin.org/post"
headers = {
    'User-Agent':'hello wolrd',
    'Host':'httpbin.org'
}
dict = {
    'name':'Tom',
}

data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
res = request.urlopen(req)
pprint(res.read().decode('utf-8'))

3.urllib 库的进阶使用

(1)代理

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http':'http://127.0.0.1:9743'
})

opener = urllib.request.build_opener(proxy_handler)
res = opener.open('https://www.taobao.com')
print(res.read())

(2)Cookie

1.获取 cookies

import http.cookiejar
import urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)

2.将 cookie 保存成文本文件

格式一：

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

格式二：

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

3.使用文件中的 cookie

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

(3)异常处理

1.实例代码一：URLError

from urllib import request
from urllib import error

try:
    urllib.request.urlopen("http://httpbin.org/xss")
except error.URLError as e:
    print(e.reason)

2.实例代码二：HTTPError

from urllib import request, error

try:
    response = request.urlopen('http://httpbin.org/xss')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

3.实例代码三：异常类型判断

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

(4)URL 解析工具类

1.urlparse

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

2.urlunparse

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

3.urljoin

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))

4.urlencode

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

0X04 Requests 库

1.什么是 requests 库

这个库是基于 URLlib3 的，改善了 urllib api 比较繁琐的特点，使用几句简单的语句就能实现设置 cookie 和设置代理的功能，非常的方便

2.requests 库的基本使用

(1)获取响应信息

import requests

res = requests.get("http://www.baidu.com")
print(res.status_code)
print(res.text)
print(res.cookies)

(2)各种请求方法

import requests

requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.head("http://httpbin.org/get")
requests.delete("http://httpbin.org/delete")
requests.options("http://httpbin.org/get")

(3)带参数的 get 请求

import requests

params = {
    'id':1,
    'user':'Tom',
    'pass':'123456'
}

res = requests.get('http://httpbin.org/get',params = params )
print(res.text)

(4)解析 json

import requests

res = requests.get("http://httpbin.org/get")
print(res.json())

(5)获取二进制数据

import requests

response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
    f.write(response.content)
    f.close()

(6)添加 headers

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

(7)POST 请求

import requests

data = {
    'id':1,
    'user':'Tom',
    'pass':'123456',
}

res = requests.post('http://httpbin.org/post',data=data)
print(res.text)

(8) response 属性

import requests

data = {
    'id':1,
    'user':'Tom',
    'pass':'123456',
}

res = requests.post('http://httpbin.org/post',data=data)
print(res.text)
print(res.status_code)
print(res.headers)
print(res.cookies)
print(res.history)
print(res.url)

(9)响应状态码

每一个状态码都对应着一个名字，我们只要调用这个名字就可以进行判断了

100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

实例代码：

import requests

response = requests.get('http://www.jianshu.com/hello.html')
exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')

3.requests 库的进阶使用

(1)文件上传

import requests

files = {'file':open('E:\\1.png','rb')}

res= requests.post('http://httpbin.org/post',files=files)
print(res.text)

(2)获取 cookies

import requests

res = requests.get("http://www.baidu.com")
for key,value in res.cookies.items():
    print(key + "=" + value)

(3)会话维持

这个用法非常的重要，在我们的模拟登陆的过程中是必然会用到的方法，在 CTF 的写脚本的过程中也经常会用到，所以我们稍微详细解释一下

我们在使用 requests.get 的时候要明确一点就是，我们每使用一个 requests.get 就相当于重新打开了一个浏览器，因此上一个 requests.get 中设置的 cookie 在下面的第二次请求中是不能同步的，我们来看下面的例子

实例代码:

import requests

#这里我们设置了 cookie 
requests.get('http://httpbin.org/cookies/set/number/123456789')
#我们再次发起请求，查看是否能带着我们设置的 cookie 
res = requests.get('http://httpbin.org/cookies')
print(res.text)

运行结果：

{
  "cookies": {}
}

我们发现，正如我们上面分析的，我们第一次访问设置的 cookie 并没有在第二次访问中生效，那么怎么办呢，我们有一个 session() 方法能帮助我们解决这个问题

实例代码：

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
res = s.get('http://httpbin.org/cookies')
print(res.text)

运行结果：

{
  "cookies": {
    "number": "123456789"
  }
}

(4)证书验证

我们在访问 https 的网站的时候浏览器首先会对网站的证书进行校验，如果发现这个证书不是官方授权的话就会出现警告页面而不会继续访问该网站，对于爬虫来讲就会抛出异常，那如果我们想要让爬虫忽略证书的问题继续访问这个网站的话就要对其进行设置

1.忽略证书验证

import requests

response = requests.get('https://www.heimidy.cc/',verify=False)
print(response.status_code)

但是此时是存在一个警告的，我们可以通过导入 urilib3 的包，并调用消除 warning 的方法来消除这个警告

import requests
from requests.packages import urllib3

urllib3.disable_warnings()
response = requests.get('https://www.heimidy.cc/',verify=False)
print(response.status_code)

2.手动指定本地证书进行验证

import requests

response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

(5)代理设置

除了常见到的 https 和 http 代理以=以外，我们还可以使用 socks 代理，不过需要 pip 安装一个 requests[socks] 包

import requests

proxies = {
    "http":"http://127.0.0.1:1080",
    "https":"https://127.0.0.1:1080"
}

res = requests.get("https://www.google.com",proxies=proxies)
print(res.status_code)

这里有一个疑问就是我是用 socks 代理访问 google 是失败的，会报错

实例代码：

import requests

proxies = {
    "http":"socks5://127.0.0.1:1080",
    "https":"socks5://127.0.0.1:1080"
}

res = requests.get("https://www.google.com",proxies=proxies,verify=False)
print(res.status_code)

运行结果：

SSLError: SOCKSHTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

试了一些方法都没什么效果，有待以后考证

(6)超时设置

import requests
from requests.exceptions import ReadTimeout
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')

(7)Basic 认证

实例代码一：

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(r.status_code)

实例代码二：

import requests

r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)

(8)异常处理

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')
except ConnectionError:
    print('Connection error')
except RequestException:
    print('Error')

0X05 正则表达式

1.什么是正则表达式

正则表达式是对字符串进行操作的一种逻辑公式，用事先定义好的一些特定的字符，以及这些字符的组合，组成一个规则字符串，用这个规则字符串去表达对字符串的一种过滤的逻辑，在python 中使用过re 库来实现

2.常见的匹配模式

3.re.match

re.match(pattern, string, flags=0)

(1)常规匹配

span() 方法是返回匹配的范围，group() 是返回匹配的结果

实例代码：

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
res = re.match('^\w{5}\s\d{3}\s\d{4}\s\w{10}.*Demo$',content)
print(res.span())
print(res.group())

(2)泛匹配

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
res = re.match('^Hello.*Demo$',content)
print(res.span())
print(res.group())

(3)匹配具体内容

我们如果想匹配具体的内容，我们可以用小括号将其括起来

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^Hello\s(\d+)\s.*Demo$',content)
print(res.span(1))
print(res.group(1))

(4)贪婪与非贪婪模式

所谓贪婪模式指的就是 .* 会匹配尽可能多的字符，我们来看下面的例子

实例代码：

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^He.*(\d+).*Demo$',content)
print(res.span(1))
print(res.group(1))

运行结果：

(12, 13)
7

我们的本意是想匹配 1234567 这个字符串，但是实际上我们只匹配到了 7 ，因为 .* 默认的贪婪模式将123456匹配掉了，那么为了解决这个问题，我们可以使用 ？ 去消除非贪婪模式

实例代码：

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^He.*?(\d+).*Demo$',content)
print(res.span(1))
print(res.group(1))

运行结果：

(6, 13)
1234567

(5)匹配模式

匹配模式是用来解决一些细节问题的，比如匹配中的是否区分大小写、是否能匹配换行符等

实例代码：

import re

content = '''Hello 1234567 World_This is 

a Regex Demo'''

res = re.match('^He.*?(\d+).*Demo$',content,re.S)
print(res.span(1))
print(res.group(1))

运行结果：

(6, 13)
1234567

可以发现我们的 .* 本来是不能匹配换行符的，但是我们使用了 re.S 的模式以后就可以正常匹配了

(6)转义字符

如果在待匹配字符串中出现了正则表达式中的特殊字符，我们要对其进行转义操作

import re

content = 'price is $5.00'
res = re.match('price is \$5\.00', content)
print(res.group())

4.re.search

我们上面介绍的 re.match 有一个弊端就是它只能从开头开始匹配，也就是说如果我们的正则不匹配第一个字符那么是无法匹配中间的字符的，所以我们还有一个武器叫 re.search，它会扫描整个字符串并返回第一个成功的匹配。

实例代码：

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
res = re.search('Hello.*?(\d+).*?Demo', content)
print(res.group(1))

运行结果：

因为这个特性能大大减少我们写正则的难度，因此，我们在能用 search 的情况下就不要用 match

匹配练习：

实例代码：

import re

html = '''
    经典老歌
    
        经典老歌列表
    
    
        一路上有你
        
            沧海一声笑
        
        
            往事随风
        
        光辉岁月
        记事本
        
            但愿人长久
        
    
'''

res = re.search('(.*?)',html,re.S)
print(res.group(1),res.group(2))

运行结果：

任贤齐 沧海一声笑

5.re.findall

与之前两个不同的是 re.findall 搜会索字符串，并以列表形式返回全部能匹配的子串。

匹配练习一：

实例代码：

import re

html = '''
    经典老歌
    
        经典老歌列表
    
    
        一路上有你
        
            沧海一声笑
        
        
            往事随风
        
        光辉岁月
        记事本
        
            但愿人长久
        
    
'''

res = re.findall('(.*?)',html,re.S)
#print(res)
for i in res:
    print(i[0],i[1],i[2])

运行结果：

[('/2.mp3', '任贤齐', '沧海一声笑'), ('/3.mp3', '齐秦', '往事随风'), ('/4.mp3', 'beyond', '光辉岁月'), ('/5.mp3', '陈慧琳', '记事本'), ('/6.mp3', '邓丽君', '但愿人长久')]
/2.mp3 任贤齐 沧海一声笑
/3.mp3 齐秦 往事随风
/4.mp3 beyond 光辉岁月
/5.mp3 陈慧琳 记事本
/6.mp3 邓丽君 但愿人长久

匹配练习二：

实例代码：

import re

html = '''
    经典老歌
    
        经典老歌列表
    
    
        一路上有你
        
            沧海一声笑
        
        
            往事随风
        
        光辉岁月
        记事本
        
            但愿人长久
        
    
'''

res = re.findall('\s*?()?(\w+)()?\s*?',html,re.S)
for i in res:
    print(i[0],i[1],i[2])

运行结果：

一路上有你 
 沧海一声笑 
 往事随风 
 光辉岁月 
 记事本 
 但愿人长久

6.re.sub

该方法的作用是替换字符串中每一个匹配的子串后返回替换后的字符串

实例代码一：

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
res = re.sub('\d+','K0rz3n',content)
print(res)

运行结果：

Extra stings Hello K0rz3n World_This is a Regex Demo Extra stings

有时候我们替换的时候需要保留原始字符串，这个时候我们就要使用正则表达式的后向引用技术

实例代码二：

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
content = re.sub('(\d+)', '\\1 8910', content)
print(content)

运行结果：

Extra stings Hello 1234567 890 World_This is a Regex Demo Extra stings

7.re.compile

该方法可以将正则表达式转换成正则表达式对象，方便我们后期的复用

实例代码：

import re

content = '''Hello 1234567 World_This
is a Regex Demo'''
pattern = re.compile('Hello.*Demo', re.S)
res = re.match(pattern, content)
print(res.group(0))

8.实战练习爬取豆瓣读书

import requests
import re
content = requests.get('http://book.douban.com/').text
pattern = re.compile('(.*?).*?year">(.*?).*?', re.S)
results = re.findall(pattern, content)
for result in results:
    url, name, author, date = result
    author = re.sub('\s', '', author)
    date = re.sub('\s', '', date)
    print(url, name, author, date)

0X06 BeautifulSoup

1.什么是 BeautifulSoup

这是一个方便的网页解析库，不用编写正则就是实现网页信息的提取

2.常见的配合使用的解析库

3.基本使用

实例代码：

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果：

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie ; and they lived at the bottom of a well.

...

The Dormouse's story

4.标签选择器

(1)选择元素

使用soup.(点)属性标签的方式来进行选择，如果有多个符合的话只能返回第一个匹配的标签

实例代码：

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.title)
print(soup.p)

运行结果：

The Dormouse's story The Dormouse's story

The Dormouse's story

(2)获取属性

实例代码：

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

运行结果：

dromouse
dromouse

(3)获取内容

实例代码：

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

运行结果：

The Dormouse's story

(4)嵌套选择

实例代码：

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)

运行结果：

The Dormouse's story

(5)获取子孙节点

1.contents

这种方法是以列表形式返回标签的子节点

实例代码：

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

运行结果：

['\n            Once upon a time there were three little sisters; and their names were\n            ', 
Elsie
, '\n', Lacie, ' \n            and\n            ', Tillie, '\n            and they lived at the bottom of a well.\n        ']

2.children

这种方法返回的是一个子节点的迭代器形式

实例代码：

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

运行结果：


0 
            Once upon a time there were three little sisters; and their names were

1 
Elsie

2 

3 Lacie
4  
            and

5 Tillie
6 
            and they lived at the bottom of a well.

3.descendants

返回子孙节点，其实和上面 children 的不同在于这个方法会再次强调一下孙子节点

实例代码：

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

运行结果：


0 
            Once upon a time there were three little sisters; and their names were

1 
Elsie

2 

3 Elsie
4 Elsie
5 

6 

7 Lacie
8 Lacie
9  
            and

10 Tillie
11 Tillie
12 
            and they lived at the bottom of a well.

(6)父节点和祖先节点

1.parent

实例代码：

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

运行结果：

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

2.parents

以列表的形式输出所有的祖先节点

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))

(7)兄弟节点

实例代码：

html = """

    
        The Dormouse's story
    
    
        
            Once upon a time there were three little sisters; and their names were
            
                Elsie
            
            Lacie 
            and
            Tillie
            and they lived at the bottom of a well.
        
        ...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

运行结果：

[(0, '\n'), (1, Lacie), (2, ' \n            and\n            '), (3, Tillie), (4, '\n            and they lived at the bottom of a well.\n        ')]
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

5.标准选择器

上面我们讲述的标签选择器虽然选择速度快，但是选择的内容也是比较笼统的，在现实中很难满足我们的需求，于是我们就需要更强大的选择器帮助我们去实现

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

(1) name

实例代码一：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
soup.find_all('ul')

运行结果：

[
 Foo
 Bar
 Jay
 
, 
 Foo
 Bar
 
]

如果我们还想获取更里面的标签，我们可以再次对获取到的 ul 标签使用 find_all()

实例代码二：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
for i in soup.find_all('ul'):
    print(i.find_all('li'))

运行结果：

[Foo
, Bar
, Jay]
[Foo
, Bar
]

(2)attrs

传入想要定位的属性键值对，就能成功定位

实例代码一：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

运行结果：

[
Foo
Bar
Jay
]
[
Foo
Bar
Jay
]

或者，如果你觉得这种方式比较复杂的话我们还可以使用更加简单的直接使用等于号链接属性和值作为参数传入来定位

实例代码二：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id = 'list-1'))
print(soup.find_all(class_ = 'panel-heading'))

运行结果：

[
Foo
Bar
Jay
]
[
Hello
]

注意：

Class 是 python 中的关键字，因此我们在写属性的时候不能直接写 classs，否则会引起歧义，所以我们改成了 class_

(3)text

实例代码：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

运行结果：

['Foo', 'Foo']

(4)其他

find( name , attrs , recursive , text , **kwargs )

find_all() 返回所有元素，而find()返回单一元素

find_parents() find_parent()

find_parents()返回所有祖先节点，find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous()

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

6.CSS选择器

(1)基本使用

通过select()直接传入CSS选择器即可完成选择

实例代码一：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel-heading'))
print(soup.select('#list-1'))
print(soup.select('li'))

运行结果：

[
Hello
]
[
Foo
Bar
Jay
]
[Foo
, Bar
, Jay
, Foo
, Bar
]

实例代码二：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

运行结果：

[Foo
, Bar
, Jay]
[Foo
, Bar
]

(2)获取属性

实例代码：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

运行结果：

list-1
list-1
list-2
list-2

(3)获取内容

实例代码：

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

运行结果：

Foo
Bar
Jay
Foo
Bar

0X07 PyQuery

PyQuery 是另一个比较强大的网页解析库，语法完全从 jQuery 迁移过来，所以对于熟悉 JQuery 的开发人员来说是非常好的选择

1.初始化

(1)字符串初始化

实例代码：

html = '''

    
         first item
         second item
         third item
         fourth item
         fifth item
     
 
'''
from pyquery import PyQuery as pq

doc = pq(html)
print(doc('li'))

运行结果：

first item
second item
third item
fourth item
fifth item

(2)URL初始化

实例代码：

from pyquery import PyQuery as pq

doc = pq(url='http://www.baidu.com')
print(doc('head'))

运行结果：

ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é

(3)文件初始化

实例代码：

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

2.基本CSS选择器

实例代码：

html = '''

    
         first item
         second item
         third item
         fourth item
         fifth item
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))

运行结果：

first item
second item
third item
fourth item
fifth item

3.查找元素

(1)子元素

实例代码一：

html = '''

    
         first item
         second item
         third item
         fourth item
         fifth item
     
 
'''

from pyquery import PyQuery as pq

doc = pq(html)
li = doc('.list').find('li')
print(li)

运行结果：

first item
second item
third item
fourth item
fifth item

当然，除了使用 find 方法以外，我们还能使用 children 方法

实例代码二：

html = '''

    
         first item
         second item
         third item
         fourth item
         fifth item
     
 
'''

from pyquery import PyQuery as pq

doc = pq(html)
items = doc('.list')
lis = items.children('.active')
print(lis)

运行结果：

third item
         fourth item

(2)父元素

实例代码一：

html = '''

    
         first item
         second item
         third item
         fourth item
         fifth item
     
 
'''

from pyquery import PyQuery as pq

doc = pq(html)
items = doc('.list')
container = items.parent()
print(container)

运行结果：


    
         first item
         second item
         third item
         fourth item
         fifth item

使用 parent() 是返回直接父节点，但是使用 parents()能返回全部的父节点

实例代码二：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents('.wrap')
print(parents)

运行结果：


    
        
             first item
             second item
             third item
             fourth item
             fifth item

(3)兄弟节点

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))

运行结果：

fourth item

4.遍历

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''

from pyquery import PyQuery as pq

doc = pq(html)
lis = doc('li').items()
for i in lis:
    print(i)

运行结果：

first item

second item

third item

fourth item

fifth item

5.获取信息

(1)获取属性

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''

from pyquery import PyQuery as pq

doc = pq(html)
a = doc('.list .item-0.active a')
print(a.attr.href)
print(a.attr('href'))

运行结果：

link3.html
link3.html

(2)获取文本

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a.text())

运行结果：

third item

(3)获取 HTML

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li.html())

运行结果：

third item

6.DOM操作

(1)addClass、removeClass

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

运行结果：

third item

third item

third item

(2)attr、css

实例代码：

html = '''

    
        
             first item
             second item
             third item
             fourth item
             fifth item
         
     
 
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name','link')
print(li)
li.css('front-size','14px')
print(li)

运行结果：

third item

third item

third item

(3)remove

实例代码：

html = '''

    Hello, World
    This is a paragraph.
 
'''

from pyquery import PyQuery as pq

doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()
print(wrap.text())

运行结果：

Hello, World
This is a paragraph.
Hello, World

(4)其他

https://pyquery.readthedocs.io/en/latest/api.html

0X08 Selenium

该函数库可以配合各种浏览器引擎以及 phantomJS 进行自动化测试工作，主要是为了解决 JS 动态渲染页面无法直接抓取的问题

1.基本使用

实例代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
    input = browser.find_element_by_id('kw')
    input.send_keys('Python')
    input.send_keys(Keys.ENTER)
    wait = WebDriverWait(browser, 10)
    wait.until(EC.presence_of_element_located((By.ID, 'content_left')))
    print(browser.current_url)
    print(browser.get_cookies())
    print(browser.page_source)
finally:
    browser.close()

2.声明对象

from selenium import webdriver

browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

3.访问页面

from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://www.baidu.com")
print(browser.page_source)
browser.close()

4.查找元素

(1)查找单个元素

实例代码一：

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element_by_id('q')
input_second = browser.find_element_by_css_selector('#q')
input_third = browser.find_element_by_xpath('//*[@id="q"]')
print(input_first, input_second, input_third)
browser.close()

运行结果：

补充：

除此之外还有一些查找元素的方法，如下

find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

实例代码二：

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element(By.ID, 'q')
print(input_first)
browser.close()

运行结果：

(2)查找多个元素

实例代码一：

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()

实例代码二：

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements(By.CSS_SELECTOR, '.service-bd li')
print(lis)
browser.close()

补充：

除了上面的查找方法，查找多个元素还有下面的一些常见的方法：

find_elements_by_name find_elements_by_xpath

find_elements_by_link_text find_elements_by_partial_link_text

find_elements_by_tag_name find_elements_by_class_name

find_elements_by_css_selector

(3)元素的交互操作

我们可以对获取的元素调用交互方法

实例代码：

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get("http://www.taobao.com")
input = browser.find_element_by_id('q')
input.send_keys('iphone')
time.sleep(1)
input.clear()
input.send_keys('ipad')
button = browser.find_element_by_class_name('btn-search')
button.click()

补充：

官方文档:

http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webelement

(4)交互动作

将动作附加到动作链中串行执行，这是我们使用 selenium 去模拟键鼠操作的非常常用的东西

实例代码：

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()

补充：

官方文档:

http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains

ActionChains方法列表

click(on_element=None) ——单击鼠标左键

click_and_hold(on_element=None) ——点击鼠标左键，不松开

context_click(on_element=None) ——点击鼠标右键

double_click(on_element=None) ——双击鼠标左键

drag_and_drop(source, target) ——拖拽到某个元素然后松开

drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某个坐标然后松开

key_down(value, element=None) ——按下某个键盘上的键

key_up(value, element=None) ——松开某个键

move_by_offset(xoffset, yoffset) ——鼠标从当前位置移动到某个坐标

move_to_element(to_element) ——鼠标移动到某个元素

move_to_element_with_offset(to_element, xoffset, yoffset)

——移动到距某个元素（左上角坐标）多少距离的位置

perform() ——执行链中的所有动作

release(on_element=None) ——在某个元素位置松开鼠标左键

send_keys(

keys_to_send) ——发送某个键到当前焦点的元素

send_keys_to_element(element,

keys_to_send) ——发送某个键到指定元素

(5)执行JavaScript

当我们找不到现成的 api 的时候，我们可以使用 js 来帮助我们实现一些动作，比如进度条的拖拽等

实例代码：

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')
browser.close()

5.获取元素信息

(1)获取属性

实例代码：

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
logo = browser.find_element_by_id('zh-top-link-logo')
print(logo)
print(logo.get_attribute('class'))
browser.close()

(2)获取文本值

实例代码：

from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input.text)
browser.close()

(3)获取ID、位置、标签名、大小

实例代码：

from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input.id)
print(input.location)
print(input.tag_name)
print(input.size)
browser.close()

6.Frame 操作

如果出现 frame 或者 iframe 我们必须要进入这个区域才能进行操作

实例代码：

import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
print(source)
try:
    logo = browser.find_element_by_class_name('logo')
except NoSuchElementException:
    print('NO LOGO')
finally:
    browser.close()
browser.switch_to.parent_frame()
logo = browser.find_element_by_class_name('logo')
print(logo)
print(logo.text)

7.等待

(1)隐式等待

这个方法是针对网页中的 ajax 请求设计的，当 webdriver 查找元素或元素并没有立即出现的时候(可能还需要后期的 ajax 请求)，隐式等待将等待一段时间再查找 DOM，默认的时间是0

实例代码：

from selenium import webdriver

browser = webdriver.Chrome()
browser.implicitly_wait(10)
browser.get('https://www.zhihu.com/explore')
input = browser.find_element_by_class_name('zu-top-add-question')
print(input)
browser.close()

运行结果：

(2)显式等待

显示等待会设置一个条件和一个最长等待时间，如果在这个最长等待时间内条件还是没有成立才会抛出异常

实例代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()
browser.get('https://www.taobao.com/')
wait = WebDriverWait(browser, 10)
input = wait.until(EC.presence_of_element_located((By.ID, 'q')))
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search')))
print(input, button)
browser.close()

运行结果：

补充：

常见的判断条件：

title_is 标题是某内容

title_contains 标题包含某内容

presence_of_element_located 元素加载出，传入定位元组，如(By.ID, ‘p’)

visibility_of_element_located 元素可见，传入定位元组

visibility_of 可见，传入元素对象

presence_of_all_elements_located 所有元素加载出

text_to_be_present_in_element 某个元素文本包含某文字

text_to_be_present_in_element_value 某个元素值包含某文字

frame_to_be_available_and_switch_to_it frame加载并切换

invisibility_of_element_located 元素不可见

element_to_be_clickable 元素可点击

staleness_of 判断一个元素是否仍在DOM，可判断页面是否已经刷新

element_to_be_selected 元素可选择，传元素对象

element_located_to_be_selected 元素可选择，传入定位元组

element_selection_state_to_be 传入元素对象以及状态，相等返回True，否则返回False

element_located_selection_state_to_be 传入定位元组以及状态，相等返回True，否则返回False

alert_is_present 是否出现Alert

官方文档：

http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions

8.前进后退

实例代码：

from selenium import webdriver 

browser = webdriver.Chrome()
browser.get("http://www.baidu.com")
browser.get("http://www.taobao.com")
browser.get("http://www.zhihu.com")
browser.back()
browser.forward()
browser.close()

9.Cookies

实例代码：

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.baidu.com')
print(browser.get_cookies())
browser.add_cookie({'name':'Tom','pass':'123456','value': 'germey'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())
browser.close()

10.选项卡操作

实例代码：

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get('http://www.baidu.com')
browser.execute_script('window.open()')
print(browser.window_handles)
browser.switch_to.window(browser.window_handles[1])
browser.get('http://www.taobao.com')
time.sleep(1)
browser.switch_to.window(browser.window_handles[0])
browser.get('http://httpbin.org')
browser.close()

运行结果：

['CDwindow-3FCC47842DFF6841B4C86EE72CB7DB93', 'CDwindow-CCFA4494DE4B6C99494BE87524153E4E']

11.异常处理

实例代码：

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException

browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
except TimeoutException:
    print('Time Out')
try:
    browser.find_element_by_id('hello')
except NoSuchElementException:
    print('No Element')
finally:
    browser.close()

运行结果：

No Element

补充：

官方文档：详细文档： http://selenium-python.readthedocs.io/api.html#module-selenium.common.exceptions

你可能感兴趣的:(python)

python实现成语接龙 Camellia 泡泡笔记 python
first_idiom='万事如意'end_str=first_idiom[-1]new_li=[first_idiom]li=['发愤图强','笑容满面','意气风发','强颜欢笑']forindexinrange(len(li)):foriinli:ifend_str==i[0]:new_li.append(i)li.remove(i)end_str=i[-1]breakprint(new_l
涛哥聊Python | borb，一个好用的 Python 库，处理 PDF 文件好帮手！双木的木 python拓展学习 python库 python 开发语言机器学习 pdf 人工智能深度学习
本文来源公众号“涛哥聊Python”，仅用于学术分享，侵权删，干货满满。原文链接：borb，一个好用的Python库！大家好，今天为大家分享一个好用的Python库-borb。Github地址：https://github.com/jorisschellekens/borbPythonBorb是一个用于处理PDF文件的Python库，它提供了丰富的功能和工具，使得PDF文件的创建、修改和解析变得更
python—计算学生成绩等级 2111339 彭传月 python
一、打开软件新建窗口输入代码#计算学生成绩等级is_continue='y'whileis_continue=='Y'oris_continue=='y':score=eval(input('请输入学生的成绩：'))ifscore>=90:print('A')elifscore>=80:print('B')elifscore>=70:print('C')elifscore>=60:print('D
CPU占用率飙升至100%：是攻击还是正常现象？群联云防护小杜安全问题汇总 ddos 安全 waf 服务器 cpu 占用被攻击
在运维和开发的日常工作中，CPU占用率突然飙升至100%往往是一个令人紧张的信号。这可能意味着服务器正在遭受攻击，但也可能是由于某些正常的、但资源密集型的任务或进程造成的。本文将探讨如何识别和应对服务器的异常CPU占用情况，并通过Python脚本示例，提供一种监控和诊断CPU占用率的方法。一、CPU占用率100%：攻击or正常？1.1攻击迹象持续性高占用：如果CPU占用率长时间保持在100%，且没
Python 成绩等级判定 Camellia 泡泡 python 笔记
score=int(input("请输入学生成绩:"))if90<=score<=100:grade="A"elif75<=score<=90:grade="B"elif60<=score<=75:grade="C"elifscore<60:grade="D"print("本次考试，等级为:",grade)运行结果：
【Python】PDFMiner.six：高效处理PDF文档的Python工具技术无疆 Python python pdf 开发语言 python3.11 人工智能数据挖掘机器学习
PDF是一种广泛使用的文件格式，特别适用于呈现固定布局的文档。然而，提取PDF文件中的文本和信息并不总是那么简单。幸好有许多Python库可以帮助我们，其中，PDFMiner.six是一个功能强大、专门用于PDF文档解析的库。⭕️宇宙起点什么是PDFMiner.six？主要功能安装PDFMiner.six♨️核心功能和代码示例1.提取PDF文档的纯文本2.从多个页面提取文本3.提取PDF中的表格内
25道Python练手题（附详细答案），赶紧收藏！_python题库字节全栈_rJF python 开发语言
importrandomasrdnumber=rd.randint(0,100)foriinrange(10):choice=int(input("请输入你要猜测的数字："))ifchoice>number:print("你猜大了")elifchoice0and5*x+3*y+z/3==100:count+=1print("="*60)print(f'第{count}种买法，公鸡买了{x}只，母鸡
python爱心代码高级 youyouxiong python 开发语言
在Python中，我们可以使用各种方法来绘制一个“爱心”形状。以下是一个使用turtle模块绘制爱心的高级示例。这个示例将使用更复杂的数学公式和图形操作来绘制一个更精致的爱心形状。importturtleimportmath#设置初始状态window=turtle.Screen()window.bgcolor("black")#设置背景色为黑色love=turtle.Turtle()love.sp
python画一个爱心戴子雯 python绘画 python
大家好这是我的地一篇博客，我要写一个关于python的文章我要用python写一个爱心。不说别的，先看效果效果如下：话不多说，上代码，在这之前要下载python下载这事咱们放在最后现在上代码！！！！！！！！！！！！！！importturtleastt.pensize(2)#笔大小2像素t.pencolor("red")#颜色为红色t.left
brew 安装pip_pip brew wget 安装 weixin_32612253 brew 安装pip
终端播放器安装教程从简书上看到一篇,终端实现网易云音乐的文章,并给出了一个github链接.心里有些痒痒,想看看是什么样子,于是尝试安装.安装过程中有些坎坷,记录以便以后查阅.程序实现是用Python写的.安装使用方式仅仅给了三行命令.安装$pipinstallnetease-musicbox$brewinstallmpg123使用$musicbox下载了源码后,不知道该如何安装.三行命令也是莫名
python实现绘制爱心函数（绘制过程） halo0416 python 开发语言
首先，确保已经安装了matplotlib库和numpy库。如果没有安装，可以通过pip来安装：pipinstallmatplotlibpipinstallnumpy了解心形函数公式：x(t)=y(t)=13cos⁡(t)−5cos⁡(2t)−2cos⁡(3t)−cos⁡(4t)定义函数：defheart_shape(t):x=16*np.sin(t)**3y=13*np.cos(t)-5*np.c
python 绘图（爱心） @小H python 开发语言
#-*-coding:utf-8-*-fromturtleimport*defcurvemove():foriinrange(200):right(1)forward(1)color('red','pink')begin_fill()left(140)forward(111.65)curvemove()left(120)curvemove()forward(111.65)end_fill()don
Mulvus向量库数据插入失败排查 Sirius Wu milvus
Mulvus是一个开源的向量数据库，要判断数据是否成功插入以及在插入失败时进行排查，可以参考以下方法：确认数据是否成功插入1.API返回结果在使用Mulvus提供的API插入数据时，API会返回相应的结果信息。以PythonSDK为例，插入数据的代码通常如下：frompymilvusimportconnections,Collection,FieldSchema,CollectionSchema,
使用 Python 绘制爱心图形（高级版）徐浪老师徐浪老师大讲堂 python 开发语言
以下是一段使用Python绘制高级“爱心”图案的代码，结合数学公式生成精美的爱心形状，并附加一些交互式的效果，比如渐变颜色或动态展示：动态渐变爱心importnumpyasnpimportmatplotlib.pyplotaspltimportmatplotlib.animationasanimation#设置爱心的数学公式defheart_shape(t):x=16*np.sin(t)**3y=
2025计算机毕设全流程实战指南：Java/Python+协同过滤+小程序开发避坑手册启点毕设课程设计 java python 大四论文指南查重降重技巧毕业设计 spring
技术框架的选择是项目开发的关键起点，直接影响开发效率和最终成果质量。然而，许多开发者在选择技术框架时面临困难：现有知识储备不足以支撑复杂项目需求，团队经验有限，框架选择缺乏前瞻性常导致后期问题。尽管技术框架的选择过程充满挑战，但合适的框架能为项目开发和维护奠定基础，而不当的选择则可能带来持续的技术债务和开发困扰。所以，建议对项目技术框架把握不好的同学，最好是找自己的研究生学长或者老师详细的把关机技
pycharm中使用anaconda部署python环境_pycharm部署配置anaconda环境教程 weixin_39796652
本篇文章小编给大家分享一下pycharm部署配置anaconda环境教程，小编觉得挺不错的，现在分享给大家供大家参考，有需要的小伙伴们可以来看看。pycharm部署anaconda环境Pycharm：python编辑器，社区版本Anaconda：开源的python发行版本(专注于数据分析的python版本)，包含大量的科学包环境基本指令(准备工作)：conda--version查看anaconda
python poetry添加某个git仓库的某个分支 waketzheng git
命令行不太清楚怎么弄，但可以通过编辑pyproject.toml实现实例：pypika-tortoise={git="https://github.com/henadzit/pypika-tortoise",branch="do-not-use-builder"}参考：WIPDonotcopypypikaquerybyhenadzit·PullRequest#1851·tortoise/torto
The following modules are *disabled* in configure script:_sqlite3 waketzheng python
Unabletoupgradepast3.6.9-#24byRosuav-PythonHelp-DiscussionsonPython.orgsudoaptinstalllibsqlite3-devcdPython-3.13.1./configure--enable-optimizations--enable-loadable-sqlite-extensionsmakesudomakealtins
CentOS7 python安装Ta-lib 0.6.x【talib不能直接安装，必须先安装ta_lib之c++库才可以】 weixin_43343144 服务器运维
正常流程：CentOS7python安装Ta-lib【talib不能直接安装，必须先安装ta_lib之c++库才可以】_centos7安装ta-lib-CSDN博客不同的版本参考如下！参考官方文档：ta-lib·PyPI务必下载匹配版本的【ta-lib-0.6.4-src.tar.gz】才可以正常安装$wgethttps://github.com/ta-lib/ta-lib/releases/do
【Kivy App】Pyjnius是什么？ Botiway 移动APP Kivy python
Pyjnius是一个Python库，用于在Python中访问Java类和方法，特别适用于在Kivy或其它Python应用中调用AndroidAPI。以下是Pyjnius的详细介绍、安装和使用方法：1.Pyjnius是什么？Pyjnius是一个Python-to-Java的桥接工具，允许Python代码直接调用Java类和方法。它基于JavaNativeInterface(JNI)，主要用于以下场景
基于Python PYQT5 的相机定时采集图像程序，GUI打包独立运行夏时summer time python qt 数码相机相机
基于PythonPYQT5编写相机定时采集图像及手动采集版本介绍Python3.6pyqt55.15.4pyqt5-tools5.15.4.3.2另外就是常用的cv2和numpy包fromPyQt5importQtCore,QtGui,QtWidgetsfromPyQt5importQtCore,QtGui,QtWidgetsimportcv2importnumpyasnpfromdatetime
《AI医疗系统开发实战录》第6期——智能导诊系统实战骆驼_代码狂魔程序员的法宝人工智能 django python neo4j 知识图谱
关注我，后期文章全部免费开放，一起推进AI医疗的发展核心主题：如何构建95%准确率的智能导诊系统？技术突破：结合BERT+知识图谱的混合模型设计一、智能导诊架构设计python基于BERT的意图识别模型（PyTorch）fromtransformersimportBertTokenizer,BertForSequenceClassificationimporttorchclassTriageMod
量化交易系统中如何处理机器学习模型的训练和部署？ openwin_top 量化交易系统开发机器学习人工智能量化交易
microPythonPython最小内核源码解析NI-motion运动控制c语言示例代码解析python编程示例系列python编程示例系列二python的Web神器Streamlit如何应聘高薪职位量化交易系统中，机器学习模型的训练和部署需要遵循一套严密的流程，以确保模型的可靠性、性能和安全性。以下是详细描述以及相关的示例：1.数据收集和预处理数据收集在量化交易中，数据是最重要的资产。收集的数
Mac下载python并安装小小酥*
下载pythonPython官网：https://www.python.org/进入官网后点击download，选择MacOSX版本2.安装MAC系统一般都自带有Python2.x版本的环境，你也可以在链接https://www.python.org/downloads/mac-osx/上下载最新版安装。3.设置环境变量程序和可执行文件可以在许多目录，而这些路径很可能不在操作系统提供可执行文件的搜
Python使用minIO上传下载身似山河挺脊梁 python
前提VSCode+Python3.9minIO有Python的例子1.python生成临时文件2.写入一些数据3.上传到minIO4.获取分享出连接5.发出通知#创建一个客户端minioClient=Minio(endpoint='xx',access_key='xx',secret_key='xx',secure=False)#生成文件名current_datetime=datetime.dat
深入理解Python上下文管理器 ……-…… python 开发语言
1.什么是上下文管理器？2.with语句的魔法3.创建上下文管理器的两种方式3.1基于类的实现3.2使用contextlib模块4.异常处理1.什么是上下文管理器？上下文管理器（ContextManager）是Python中用于精确分配和释放资源的机制。它通过__enter__()和__exit__()两个魔术方法实现了上下文管理协议，确保即使在代码执行出错的情况下，资源也能被正确清理。#经典文件
【Appium】Appium征服安卓自动化：GitHub 10.5k+星开源神器，Python代码实战全解析！山河不见老 python 测试 appium android 自动化
Appium一、为什么开发者都在用Appium？二、环境搭建：5分钟极速配置2.1核心工具链2.2安卓设备连接三、脚本实战：从零编写自动化操作3.1示例1：自动登录微信并发送消息3.2示例2：动态滑动屏幕与数据抓取四、避坑指南4.1元素定位优化4.2稳定性增强4.3云真机集成五、生态扩展：超越安卓的自动化版图一、为什么开发者都在用Appium？万星认证：GitHub超10.5k+星标，活跃社区持续
基于Streamlit实现的音频处理示例大霸王龙音视频 ffmpeg
基于Streamlit实现的音频处理示例，包含录音、语音转文本、文件下载和进度显示功能，整合了多个技术方案：一、环境准备#安装依赖库pipinstallstreamlitstreamlit-webrtcaudio-recorder-streamlitopenai-whisperpython-dotx二、完整示例代码importstreamlitasstfromaudio_recorder_stre
npm错误 gyp错误 vs版本不对 msvs_version不兼容澎湖Java架构师前端 html npm node.js 前端
npm错误gyp错误vs版本不对msvs_version不兼容windowsSDK报错执行更新GYP语句第一种方案第二种方案执行更新GYP语句npminstall-gnode-gyp最新的GYP好像已经不支持Python2.7版本，npm会提示你更新都3.*.*版本安装Node.js的时候一定要勾选以下这个，会自动检测安装缺少的环境第一种方案管理员运行CMD（PowerShell也行）执行更新工具
深入了解 ArangoDB 的图数据库应用与 Python 实践 eahba 数据库 python 开发语言
在当前数据驱动的时代，对连接数据的高效处理和分析需求日益增长。ArangoDB作为一个可扩展的图数据库系统，能够加速从连接数据中获取价值。本文将介绍如何使用Python连接和操作ArangoDB，并展示如何结合图问答链来获取数据洞察。技术背景介绍ArangoDB是一个多模型数据库，支持文档、图和键值类型的数据存储。其强大的图形存储和查询能力使其成为处理复杂数据关系的理想选择。通过JSON支持和单一
桌面上有多个球在同时运动，怎么实现球之间不交叉，即碰撞？换个号韩国红果果 html 小球碰撞
稍微想了一下，然后解决了很多bug，最后终于把它实现了。其实原理很简单。在每改变一个小球的x y坐标后，遍历整个在dom树中的其他小球，看一下它们与当前小球的距离是否小于球半径的两倍？若小于说明下一次绘制该小球（设为a）前要把他的方向变为原来相反方向（与a要碰撞的小球设为b），即假如当前小球的距离小于球半径的两倍的话，马上改变当前小球方向。那么下一次绘制也是先绘制b，再绘制a，由于a的方向已经改变
《高性能HTML5》读后整理的Web性能优化内容白糖_ html5
读后感先说说《高性能HTML5》这本书的读后感吧，个人觉得这本书前两章跟书的标题完全搭不上关系，或者说只能算是讲解了“高性能”这三个字，HTML5完全不见踪影。个人觉得作者应该首先把HTML5的大菜拿出来讲一讲，再去分析性能优化的内容，这样才会有吸引力。因为只是在线试读，没有机会看后面的内容，所以不胡乱评价了。
[JShop]Spring MVC的RequestContextHolder使用误区 dinguangx jeeshop 商城系统 jshop 电商系统
在spring mvc中，为了随时都能取到当前请求的request对象，可以通过RequestContextHolder的静态方法getRequestAttributes()获取Request相关的变量，如request, response等。在jshop中，对RequestContextHolder的
算法之时间复杂度周凡杨 java 算法时间复杂度效率
在计算机科学中，算法的时间复杂度是一个函数，它定量描述了该算法的运行时间。这是一个关于代表算法输入值的字符串的长度的函数。时间复杂度常用大O符号表述，不包括这个函数的低阶项和首项系数。使用这种方式时，时间复杂度可被称为是渐近的，它考察当输入值大小趋近无穷时的情况。这样用大写O()来体现算法时间复杂度的记法，
Java事务处理 g21121 java
一、什么是Java事务通常的观念认为，事务仅与数据库相关。事务必须服从ISO/IEC所制定的ACID原则。ACID是原子性（atomicity）、一致性（consistency）、隔离性（isolation）和持久性（durability）的缩写。事务的原子性表示事务执行过程中的任何失败都将导致事务所做的任何修改失效。一致性表示当事务执行失败时，所有被该事务影响的数据都应该恢复到事务执行前的状
Linux awk命令详解 510888780 linux
一. AWK 说明 awk是一种编程语言，用于在linux/unix下对文本和数据进行处理。数据可以来自标准输入、一个或多个文件，或其它命令的输出。它支持用户自定义函数和动态正则表达式等先进功能，是linux/unix下的一个强大编程工具。它在命令行中使用，但更多是作为脚本来使用。 awk的处理文本和数据的方式：它逐行扫描文件，从第一行到
android permission 布衣凌宇 Permission
<uses-permission android:name="android.permission.ACCESS_CHECKIN_PROPERTIES" ></uses-permission>允许读写访问"properties"表在checkin数据库中，改值可以修改上传 <uses-permission android:na
Oracle和谷歌Java Android官司将推迟 aijuans java oracle
北京时间 10 月 7 日，据国外媒体报道，Oracle 和谷歌之间一场等待已久的官司可能会推迟至 10 月 17 日以后进行，这场官司的内容是 Android 操作系统所谓的 Java 专利权之争。本案法官 William Alsup 称根据专利权专家 Florian Mueller 的预测，谷歌 Oracle 案很可能会被推迟。　　该案中的第二波辩护被安排在 10 月 17 日出庭，从目前看来
linux shell 常用命令 antlove linux shell command
grep [options] [regex] [files] /var/root # grep -n "o" * hello.c:1:/* This C source can be compiled with:
Java解析XML配置数据库连接(DOM技术连接 SAX技术连接) 百合不是茶 sax技术 Java解析xml文档 dom技术 XML配置数据库连接
XML配置数据库文件的连接其实是个很简单的问题,为什么到现在才写出来主要是昨天在网上看了别人写的,然后一直陷入其中,最后发现不能自拔所以今天决定自己完成 ,,,,现将代码与思路贴出来供大家一起学习 XML配置数据库的连接主要技术点的博客; JDBC编程 : JDBC连接数据库 DOM解析XML: DOM解析XML文件 SA
underscore.js 学习（二） bijian1013 JavaScript underscore
Array Functions 所有数组函数对参数对象一样适用。1.first _.first(array, [n]) 别名: head, take 返回array的第一个元素，设置了参数n，就
plSql介绍 bijian1013 oracle 数据库 plsql
/* * PL/SQL 程序设计学习笔记 * 学习plSql介绍.pdf * 时间：2010-10-05 */ --创建DEPT表 create table DEPT ( DEPTNO NUMBER(10), DNAME NVARCHAR2(255), LOC NVARCHAR2(255) ) delete dept; select
【Nginx一】Nginx安装与总体介绍 bit1129 nginx
启动、停止、重新加载Nginx nginx 启动Nginx服务器，不需要任何参数u nginx -s stop 快速(强制)关系Nginx服务器 nginx -s quit 优雅的关闭Nginx服务器 nginx -s reload 重新加载Nginx服务器的配置文件 nginx -s reopen 重新打开Nginx日志文件
spring mvc开发中浏览器兼容的奇怪问题 bitray jquery Ajax springMVC 浏览器上传文件
最近个人开发一个小的OA项目,属于复习阶段.使用的技术主要是spring mvc作为前端框架,mybatis作为数据库持久化技术.前台使用jquery和一些jquery的插件. 在开发到中间阶段时候发现自己好像忽略了一个小问题,整个项目一直在firefox下测试,没有在IE下测试,不确定是否会出现兼容问题.由于jquer
Lua的io库函数列表 ronin47 lua io
1、io表调用方式：使用io表，io.open将返回指定文件的描述，并且所有的操作将围绕这个文件描述　　io表同样提供三种预定义的文件描述io.stdin,io.stdout,io.stderr 　　2、文件句柄直接调用方式,即使用file:XXX()函数方式进行操作,其中file为io.open()返回的文件句柄　　多数I/O函数调用失败时返回nil加错误信息,有些函数成功时返回nil
java-26-左旋转字符串 bylijinnan java
public class LeftRotateString { /** * Q 26 左旋转字符串 * 题目：定义字符串的左旋转操作：把字符串前面的若干个字符移动到字符串的尾部。 * 如把字符串abcdef左旋转2位得到字符串cdefab。 * 请实现字符串左旋转的函数。要求时间对长度为n的字符串操作的复杂度为O(n)，辅助内存为O(1)。 */ pu
《vi中的替换艺术》-linux命令五分钟系列之十一 cfyme linux命令
vi方面的内容不知道分类到哪里好，就放到《Linux命令五分钟系列》里吧！今天编程，关于栈的一个小例子，其间我需要把”S.”替换为”S->”(替换不包括双引号)。其实这个不难，不过我觉得应该总结一下vi里的替换技术了，以备以后查阅。 1 所有替换方案都要在冒号“:”状态下书写。 2 如果想将abc替换为xyz，那么就这样 :s/abc/xyz/ 不过要特别
[轨道与计算]新的并行计算架构 comsci 并行计算
我在进行流程引擎循环反馈试验的过程中，发现一个有趣的事情。。。如果我们在流程图的每个节点中嵌入一个双向循环代码段，而整个流程中又充满着很多并行路由，每个并行路由中又包含着一些并行节点，那么当整个流程图开始循环反馈过程的时候，这个流程图的运行过程是否变成一个并行计算的架构呢？
重复执行某段代码 dai_lm android
用handler就可以了 private Handler handler = new Handler(); private Runnable runnable = new Runnable() { public void run() { update(); handler.postDelayed(this, 5000); } }; 开始计时 h
Java实现堆栈（list实现） datageek 数据结构——堆栈
public interface IStack<T> { //元素出栈，并返回出栈元素 public T pop(); //元素入栈 public void push(T element); //获取栈顶元素 public T peek(); //判断栈是否为空 public boolean isEmpty
四大备份MySql数据库方法及可能遇到的问题 dcj3sjt126com DB backup
一：通过备份王等软件进行备份前台进不去？用备份王等软件进行备份是大多老站长的选择，这种方法方便快捷，只要上传备份软件到空间一步步操作就可以，但是许多刚接触备份王软件的客用户来说还原后会出现一个问题：因为新老空间数据库用户名和密码不统一，网站文件打包过来后因没有修改连接文件，还原数据库是好了，可是前台会提示数据库连接错误，网站从而出现打不开的情况。解决方法：学会修改网站配置文件，大多是由co
github做webhooks：[1]钩子触发是否成功测试 dcj3sjt126com github git webhook
转自: http://jingyan.baidu.com/article/5d6edee228c88899ebdeec47.html github和svn一样有钩子的功能，而且更加强大。例如我做的是最常见的push操作触发的钩子操作，则每次更新之后的钩子操作记录都会在github的控制板可以看到！工具/原料 github 方法/步骤
">的作用" target="_blank">JSP中的作用蕃薯耀
JSP中<base href="<%=basePath%>">的作用 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
linux下SAMBA服务安装与配置 hanqunfeng linux
局域网使用的文件共享服务。一.安装包： rpm -qa | grep samba samba-3.6.9-151.el6.x86_64 samba-common-3.6.9-151.el6.x86_64 samba-winbind-3.6.9-151.el6.x86_64 samba-client-3.6.9-151.el6.x86_64 samba-winbind-clients
guava cache IXHONG cache
缓存，在我们日常开发中是必不可少的一种解决性能问题的方法。简单的说，cache 就是为了提升系统性能而开辟的一块内存空间。　　缓存的主要作用是暂时在内存中保存业务系统的数据处理结果，并且等待下次访问使用。在日常开发的很多场合，由于受限于硬盘IO的性能或者我们自身业务系统的数据处理和获取可能非常费时，当我们发现我们的系统这个数据请求量很大的时候，频繁的IO和频繁的逻辑处理会导致硬盘和CPU资源的
Query的开始--全局变量,noconflict和兼容各种js的初始化方法 kvhur JavaScript jquery css
这个是整个jQuery代码的开始，里面包含了对不同环境的js进行的处理，例如普通环境，Nodejs，和requiredJs的处理方法。还有jQuery生成$, jQuery全局变量的代码和noConflict代码详解完整资源： http://www.gbtags.com/gb/share/5640.htm jQuery 源码： (
美国人的福利和中国人的储蓄 nannan408
今天看了篇文章，震动很大，说的是美国的福利。美国医院的无偿入院真的是个好措施。小小的改善，对于社会是大大的信心。小孩，税费等，政府不收反补，真的体现了人文主义。美国这么高的社会保障会不会使人变懒？答案是否定的。正因为政府解决了后顾之忧，人们才得以倾尽精力去做一些有创造力，更造福社会的事情，这竟成了美国社会思想、人
N阶行列式计算(JAVA) qiuwanchi N阶行列式计算
package gaodai; import java.util.List; /** * N阶行列式计算 * @author 邱万迟 * */ public class DeterminantCalculation { public DeterminantCalculation(List<List<Double>> determina
C语言算法之打渔晒网问题 qiufeihu c 算法
如果一个渔夫从2011年1月1日开始每三天打一次渔，两天晒一次网，编程实现当输入2011年1月1日以后任意一天，输出该渔夫是在打渔还是在晒网。代码如下： #include <stdio.h> int leap(int a) /*自定义函数leap()用来指定输入的年份是否为闰年*/ { if((a%4 == 0 && a%100 != 0
XML中DOCTYPE字段的解析 wyzuomumu xml
DTD声明始终以!DOCTYPE开头,空一格后跟着文档根元素的名称,如果是内部DTD,则再空一格出现[],在中括号中是文档类型定义的内容. 而对于外部DTD,则又分为私有DTD与公共DTD,私有DTD使用SYSTEM表示,接着是外部DTD的URL. 而公共DTD则使用PUBLIC,接着是DTD公共名称,接着是DTD的URL. 私有DTD <!DOCTYPErootSYST