微信订阅号:客玉京
命令行安装
pip2.x安装requests【pip install requests】
pip3.x安装requests【pip3 install requests】
pyCharm安装
file --> default settings
--> project interpreter
--> 搜索:requests
--> install package
--> OK
代码中引入requests库
import requests
方法 | 说明 |
---|---|
requests.request() | 构造一个请求,支撑一下各方法的基础方法 |
requests.get() | 获取HTML网页的主要方法 |
requests.head() | 获取网页头信息的方法 |
requests.post() | 向HTML网页提交POST请求的方法 |
requests.put() | 向HTML网页提交PUT请求的方法 |
requests.patch() | 向HTML网页提交局部修改请求 |
requests.delete() | 向HTML页面提交删除请求 |
HTTP的方法:
方法 | 说明 |
---|---|
GET | 请求获取URL位置的资源 |
HEAD | 请求获取URL位置资源的响应消息报告,即获得该资源的头部信息 |
POST | 请求向URL位置的资源后附加新的数据 |
PUT | 请求向URL位置存储一个资源,覆盖原URL位置的资源 |
PATCH | 请求局部更新URL位置的资源,即改变该处资源的部分内容 |
DELETE | 请求删除URL位置存储的资源 |
root1@ubuntu:~/桌面$ su root
密码:
root@ubuntu:/home/root1/桌面# python3
Python 3.7.5 (default, Apr 19 2020, 20:18:17)
>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK-->........
<title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title>
....... </html>\r\n'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding='utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK-->
........
<title>百度一下,你就知道</title>
.......</html>\r\n'
异常 | 说明 |
---|---|
requests.ConnectionError | 网络链接错误一场,如DNS查询失败、拒绝连接等 |
requests.HTTPError | HTTP错误异常 |
requests.URLRequired | URL缺失异常 |
requests.TooManyRedirects | 超过最大重定向次数,产生重定向异常 |
requests.ConnectTimeout | 连接远程服务器超时异常 |
requests.Timeout | 请求URL超时,产生超时异常 |
r.raise_for_status | 如果不是200,产生异常requestsHTTPError |
# 引入requests库
import requests
# 京东华为某个手机的网页信息 https://item.jd.com/10023108638660.html
# 京东手机的网页信息 https://shuma.jd.com/
url = "https://shuma.jd.com/"
# 异常处理部分:联系网络的时候不知道会不会链接失败,异常就是处理这个的
try:
# get方法,根据url得到相关的网页信息
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[:10000])
except:
print("爬取失败!")
Mozilla/5.0基本上所有的浏览器的user-agent都带,后面是机器的系统信息,再往后是内核信息,最后的Mobile Safari才是真正的浏览器使用的user-agent,也就是使用移动Safari的特性
import requests
url = "https://www.amazon/gp/product/B01M8L5Z3Y"
try:
kv = {
'user-agent':'Mozilla/5.0'}
# get方法,根据url得到相关的网页信息
r = requests.get(url,headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[1500:2000])
except:
print("爬取失败!")
搜索引擎关键字提交接口:
# 爬取百度搜索
import requests
keyword = "python"
uel = "http://www.baidu.com/s"
try:
kv = {
'wd':keyword}
r = requests.get(url,params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败!")
# 爬取360搜索
import requests
keyword = "python"
uel = "http://www.so.com/s"
try:
kv = {
'q':keyword}
r = requests.get(url,params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败!")
使用代码爬取并保持下来
# 导入模块
import requests
path = "F://好看图片//"
# 下载图片地址 # 填上要爬取的图片地址
url = "https://www.baidu.com/img/bdlogo.png"
# 发送请求获取响应
response = requests.get(url)
# 保存图片
with open('image.png','wb') as f:
f.write(response.content)
import requests
import os
url = "https://www.baidu.com/img/bdlogo.png"
root = "F://好看图片//"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功!")
else:
print("文件已经存在!")
except:
print("爬取失败!")
import requests
url = "http://m.ip138.com/ip.asp?ip="
try:
r = requests.get(url+'14.215.177.38')
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[-500:]) # 显示最后的500行
except:
print("爬取失败!")
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。
pip install beautifulsoup4
# 第一种
from bs4 import BeautifulSoup
# 第二种
import bs4
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和<>标明开头和结尾 |
Name | 标签的名字,< p >…< /p >的名字是’p’,格式:< tag >.name |
Attributes | 标签的属性,字典形式组织,格式:< tag >.attrs |
NavigableString | 标签内非属性字符创,<>…>中字符串,格式:< tag >.string |
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
序号 | 解析库 | 使用方法 | 使用条件 | 优势 | 劣势 |
---|---|---|---|---|---|
1 | Python标准库 | BeautifulSoup(html,’html.parser’) | 安装bs4 | Python内置标准库;执行速度快 | 容错能力较差 |
2 | lxml HTML解析库 | BeautifulSoup(html,’lxml’) | pip install lxml | 速度快;容错能力强 | 需要安装,需要C语言库 |
3 | lxml XML解析库 | BeautifulSoup(html,[‘lxml’,’xml’]) | pip install lxml | 速度快;容错能力强;支持XML格式 | 需要C语言库 |
4 | htm5lib解析库 | BeautifulSoup(html,’htm5llib’) | pip install html5lib | 以浏览器方式解析,最好的容错性 | 速度慢 |
C:\Users\Administrator>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p cl
ass="title"><b>The demo python introduces several python courses.</b></p>\r\n<p
class="course">Python is a wonderful general-purpose programming language. You c
an learn Python from novice to professional by tracking the following courses:\r
\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">B
asic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" cl
ass="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
# BeautifulSoup(待解析的文本或者结构性字符串,解析器)
>>> soup = BeautifulSoup(demo,"html.parser")
# soup.prettify() : BeautifulSoup的格式化输出函数
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Pyt
hon from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="lin
k2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title
<title>This is a python demo page</title>
# a标签的名字
>>> soup.a.name
'a'
# a标签父亲标签的名字
>>> soup.a.parent.name
'p'
# a标签父亲标签的父亲标签的名字
>>> soup.a.parent.parent.name
'body'
>>> tag = soup.a # 单独的提取a标签
>>> type(tag) # Tag在bs4里面有专门的类型,用来存标签
<class 'bs4.element.Tag'>
# a标签的属性
>>> soup.a.attrs
{
'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id':
'link1'}
# 这些属性储存的结构是字典
>>> type(tag.attrs)
<class 'dict'>
# 查看属性中id属性的值
>>> soup.a.attrs['id']
'link1'
# 查看a标签的内容
>>> tag.string
'Basic Python'
# 查看a标签的内容
>>> soup.a.string
'Basic Python'
# 查看p标签的内容
>>> soup.p.string
'The demo python introduces several python courses.'
# 查看p标签的内容的类型
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("p标签的内容部分
","html.parser")
# 打印b标签
>>> soup.b
<b><!--注释--></b>
# 但是打印内容的时候只是将文本打印,并没有注释的样式
>>> soup.b.string
'注释'
# 通过类型可以知道这个是一个注释
>>> type(soup.b.string)
<class 'bs4.element.Comment'>
# 打印p标签内容的时候也是打印的文本
>>> soup.p.string
'p标签的内容部分'
# 这个是p标签的内容的类型
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
# 要区别是不是注释只能通过类型进行拍断typt()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N9svBJyb-1611882746568)(D:\微信公众号相关\计算机相关\计算机图片\python\遍历3种.png)]
# 三种遍历方式的了解
# 1.自下而上的遍历是从叶子节点向根节点
# 2.自上而下的遍历是从根节点向叶子节点
# 3.平行遍历是平级标签的遍历
属性 | 说明 |
---|---|
.contents | 子节点的列表,将所有儿子节点存入列表 |
.children | 子节点的迭代类型,也是用于循环所有儿子节点 |
.descendants | 子孙节点的迭代类型,循环遍历所有子孙节点 |
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head
# 获得head标签的子节点列表
>>> soup.head.contents
# 获得body标签的子节点列表
>>> soup.body.contents
# body子节点列表的长度
>>> len(soup.body.contents)
# 解锁body子节点列表的第二个标签
>>> soup.body.contents[1]
# 遍历子节点
>>> for child in soup.body.children:
... print(child)
# 遍历子孙节点
>>> for child in soup.body.descendants:
... print(child)
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. Yo
u can learn Python from novice to professional by tracking the following courses
:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Bas
ic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001
870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent
>>>
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
# 打印a标签 标签树的上行遍历
>>> for parent in soup.a.parents:
... if parent is None:
... print(parent)
... else:
... print(parent.name)
...
p
body
html
[document]
属性 | 说明 |
---|---|
.next_ sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous_ sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_ siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous_ siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.next_sibling.next_sibling.next_sibling
'.'
>>> soup.a.next_sibling.next_sibling.next_sibling.next_sibling
>>>
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
# 遍历前序节点
>>> for psibling in soup.a.previous_siblings:
... print(psibling)
...
Python is a wonderful general-purpose programming language. You can learn Python
from novice to professional by tracking the following courses:
# 遍历后序节点
>>> for nsibling in soup.a.next_siblings:
... print(nsibling)
...
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"
>Advanced Python</a>
.
>>>
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. Yo
u can learn Python from novice to professional by tracking the following courses
:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Bas
ic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001
870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.prettify()
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>
\n <body>\n <p class="title">\n <b>\n The demo python introduces several p
ython courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful
general-purpose programming language. You can learn Python from novice to profes
sional by tracking the following courses:\n <a class="py1" href="http://www.ic
ourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="lin
k2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Pyt
hon from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="lin
k2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>> soup.a.prettify()
'<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n
Basic Python\n</a>\n'
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
>>>
>>> from bs4 import BeautifulSoup
>>> soupZN = BeautifulSoup("一段中文","html.parser")
>>> soupZN.b.string
'一段中文'
>>> print(soupZN.b.prettify())
<b>
一段中文
</b>
XML:eXtensible Markup Language,可扩展的标记语言。
主要的格式如下:
<name>...</name> # 常见的格式
<name /> # 空元素缩写形式
<!-- --> # 注释书写形式
JSON:JavsScript Object Notation.
主要格式如下:
"key":"value"
"key":["value1","value2"]
"key":{
"subke":"value"}
YAML:YAML Ain’t Markup Language。
主要格式如下:
key:value
key:#comment # 用"#"表示注释
-value1 # 用"-"表示并列关系
-value2
key:
subkey:subvalue # 用缩进表示所属关系
YAML跟Python一样,用缩进表示所属关系,用"-“表示并列关系,用”#"表示注释,用“|”表达整块数据。
完整解析信息的标记形式,再提取关键信息
文本格式:XML JSON YAML
需要标记解析器
如bs库的标签树遍历
优点:信息解析准确
缺点:提取过程繁琐
这里有一个要求是需要对文件的组织形式十分的了解
无标记形式,直接搜索关键信息
对文本进行查找:搜索
对信息的文本查找函数即可
优点:过程简洁,速度较快
缺点:提取结果准确性与内容相关
融合方法:结合形式解析与搜索方法,提取关键信息。
文本格式:XML JSON YAML
对文本进行查找:搜索
需要标记解析器及文本查找函数。
实例提取HTML中所有URL链接
1)搜索到所有a标签
2)解析a标签格式,提取href后的链接内容。
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
<>.find_all(name,attrs,recursive,string,**kwargs) | 返回一个列表类型,储存查找的结果 |
---|---|
name | 对标签名称的检索字符串 |
attrs | 对标签属性值的检索字符串,可标注属性检索 |
recursive | 是否对子孙全部检索,默认True(布尔型) |
string | <>…>中字符串区域的检索字符串 |
示例:显示所有a标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Ba
sic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-100187
0001" id="link2">Advanced Python</a>]
示例:显示所有a标签,b标签([‘a’,‘b’]:以列表形式作为参数)
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href=
"http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a cl
ass="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Adva
nced Python</a>]
示例:显示所有标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for tag in soup.find_all(True):
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
re模块是python独有的匹配字符串的模块,该模块中提供的很多功能是基于正则表达式实现的,而正则表达式是对字符串进行模糊匹配,提取自己需要的字符串部分,他对所有的语言都通用
示例:显示以b开头的标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
... print(tag.name)
...
body
b
示例:显示包含course属性的p标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. Y
ou can learn Python from novice to professional by tracking the following course
s:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Bas
ic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001
870001" id="link2">Advanced Python</a>.</p>]
示例:显示id属性值是link1的标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Ba
sic Python</a>]
示例:显示id属性值有link的标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
示例:显示子孙标签中a标签和只显示当前标签的儿子节点的a标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>> soup.find_all(string='Basic Python')
['Basic Python']
示例:查询<>…>中字符串区域的检索字符串
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
# 检索字符串是Basic Python
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python course
s.']
>>> soup.find_all(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Pyth
on from novice to professional by tracking the following courses:\r\n', 'Basic P
ython', 'Advanced Python']
示例:查询包含特定字符串
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
# 检索字符串中有python
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python course
s.']
# 检索字符串中有Python
>>> soup.find_all(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Pyth
on from novice to professional by tracking the following courses:\r\n', 'Basic P
ython', 'Advanced Python']
# soup.find_all()=soup()
# 检索字符串中有Python
>>> soup(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Pyth
on from novice to professional by tracking the following courses:\r\n', 'Basic P
ython', 'Advanced Python']
方法 | 说明 |
---|---|
<>.find() | 搜索且只返回一个结果,字符串类型,同.find_ all()参 数 |
<>.find parents() | 在先辈节点中搜索,返回列表类型,同.find all()参数 |
<>.find_ parent() | 在先辈节点中返回一个结果,字符串类型,同.find()参数 |
<>.find next siblings() | 在后续平行节点中搜索,返回列表类型,同.find_ all()参数 |
<>.find_ next sibling() | 在后续平行节点中返回一个结果,字符串类型,同.find()参数 |
<>.find previous_ siblings() | 在前序平行节点中搜索,返回列表类型,同.find_ all()参数 |
<>.find previous_ sibling() | 在前序平行节点中返回一个结果,字符串类型,同.find()参数 |