爬虫技术初级(全)

REQUESTS 库

微信订阅号:客玉京
  • 在python内置模块的基础上进行了高度的封装
  • 使用Requests可以轻而易举的完成浏览器可有的任何操作

安装

  1. 命令行安装

    1. pip2.x安装requests【pip install requests】

    2. pip3.x安装requests【pip3 install requests】

  2. pyCharm安装

     file --> default settings
           --> project interpreter 
            --> 搜索:requests 
            --> install package
             --> OK
    
  3. 代码中引入requests库

    import requests
    

requests 的7个方法

方法 说明
requests.request() 构造一个请求,支撑一下各方法的基础方法
requests.get() 获取HTML网页的主要方法
requests.head() 获取网页头信息的方法
requests.post() 向HTML网页提交POST请求的方法
requests.put() 向HTML网页提交PUT请求的方法
requests.patch() 向HTML网页提交局部修改请求
requests.delete() 向HTML页面提交删除请求

HTTP的方法:

方法 说明
GET 请求获取URL位置的资源
HEAD 请求获取URL位置资源的响应消息报告,即获得该资源的头部信息
POST 请求向URL位置的资源后附加新的数据
PUT 请求向URL位置存储一个资源,覆盖原URL位置的资源
PATCH 请求局部更新URL位置的资源,即改变该处资源的部分内容
DELETE 请求删除URL位置存储的资源
root1@ubuntu:~/桌面$ su root
密码: 
root@ubuntu:/home/root1/桌面# python3
Python 3.7.5 (default, Apr 19 2020, 20:18:17) 
>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.status_code
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK-->........
<title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title>
....... </html>\r\n'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding='utf-8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK-->
........
<title>百度一下,你就知道</title>
.......</html>\r\n'

Requests库的异常

异常 说明
requests.ConnectionError 网络链接错误一场,如DNS查询失败、拒绝连接等
requests.HTTPError HTTP错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时,产生超时异常
r.raise_for_status 如果不是200,产生异常requestsHTTPError

爬取京东页面的代码

# 引入requests库
import requests
# 京东华为某个手机的网页信息 https://item.jd.com/10023108638660.html
# 京东手机的网页信息 https://shuma.jd.com/
url = "https://shuma.jd.com/"
# 异常处理部分:联系网络的时候不知道会不会链接失败,异常就是处理这个的
try:
	# get方法,根据url得到相关的网页信息
	r = requests.get(url)
	r.raise_for_status()
	r.encoding = r.apparent_encoding
	print(r.text[:10000])
except:
	print("爬取失败!")

爬取亚马逊页面的代码

Mozilla/5.0基本上所有的浏览器的user-agent都带,后面是机器的系统信息,再往后是内核信息,最后的Mobile Safari才是真正的浏览器使用的user-agent,也就是使用移动Safari的特性

import requests
url = "https://www.amazon/gp/product/B01M8L5Z3Y"
try:
    kv = {
     'user-agent':'Mozilla/5.0'}
	# get方法,根据url得到相关的网页信息
	r = requests.get(url,headers=kv)
	r.raise_for_status()
	r.encoding = r.apparent_encoding
	print(r.text[1500:2000])
except:
	print("爬取失败!")

爬取百度和360搜索的代码

搜索引擎关键字提交接口:

  1. 百度关键字提交接口:【http://www.baidu.com/s?wd=keyword】
  2. 360的关键字提交接口:【http://www.so.com/s?q=keyword】
# 爬取百度搜索
import requests 
keyword = "python"
uel = "http://www.baidu.com/s"
try:
    kv = {
     'wd':keyword}
    r  = requests.get(url,params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败!")
# 爬取360搜索
import requests 
keyword = "python"
uel = "http://www.so.com/s"
try:
    kv = {
     'q':keyword}
    r  = requests.get(url,params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败!")

网络图片爬取和存储

使用代码爬取并保持下来

# 导入模块
import requests
path = "F://好看图片//"
# 下载图片地址 # 填上要爬取的图片地址
url = "https://www.baidu.com/img/bdlogo.png"
# 发送请求获取响应
response = requests.get(url)
# 保存图片
with open('image.png','wb') as f:
  f.write(response.content)
import requests
import os
url = "https://www.baidu.com/img/bdlogo.png"
root = "F://好看图片//"
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功!")
    else:
        print("文件已经存在!")
except:
    print("爬取失败!")

IP归属自动查出

import requests
url = "http://m.ip138.com/ip.asp?ip="
try:
    r = requests.get(url+'14.215.177.38')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[-500:]) # 显示最后的500行
except:
    print("爬取失败!")

beautifulsoup4

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。

安装bs4

pip install beautifulsoup4

引用bs4库

# 第一种
from bs4  import BeautifulSoup
# 第二种
import bs4

BeautifulSoup 类的基本元素

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和<>标明开头和结尾
Name 标签的名字,< p >…< /p >的名字是’p’,格式:< tag >.name
Attributes 标签的属性,字典形式组织,格式:< tag >.attrs
NavigableString 标签内非属性字符创,<>…中字符串,格式:< tag >.string
Comment 标签内字符串的注释部分,一种特殊的Comment类型

BeautifulSoup 解析器

序号 解析库 使用方法 使用条件 优势 劣势
1 Python标准库 BeautifulSoup(html,’html.parser’) 安装bs4 Python内置标准库;执行速度快 容错能力较差
2 lxml HTML解析库 BeautifulSoup(html,’lxml’) pip install lxml 速度快;容错能力强 需要安装,需要C语言库
3 lxml XML解析库 BeautifulSoup(html,[‘lxml’,’xml’]) pip install lxml 速度快;容错能力强;支持XML格式 需要C语言库
4 htm5lib解析库 BeautifulSoup(html,’htm5llib’) pip install html5lib 以浏览器方式解析,最好的容错性 速度慢

关于页面标签的基本了解

爬虫技术初级(全)_第1张图片

解析HTML页面小案例

C:\Users\Administrator>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
 on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p cl
ass="title"><b>The demo python introduces several python courses.</b></p>\r\n<p
class="course">Python is a wonderful general-purpose programming language. You c
an learn Python from novice to professional by tracking the following courses:\r
\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">B
asic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" cl
ass="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
# BeautifulSoup(待解析的文本或者结构性字符串,解析器)
>>> soup = BeautifulSoup(demo,"html.parser")
# soup.prettify() : BeautifulSoup的格式化输出函数
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Pyt
hon from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="lin
k2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title
<title>This is a python demo page</title>
# a标签的名字
>>> soup.a.name
'a'
# a标签父亲标签的名字
>>> soup.a.parent.name
'p'
# a标签父亲标签的父亲标签的名字
>>> soup.a.parent.parent.name
'body'
>>> tag = soup.a    # 单独的提取a标签
>>> type(tag)       # Tag在bs4里面有专门的类型,用来存标签
<class 'bs4.element.Tag'>
# a标签的属性
>>> soup.a.attrs
{
     'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id':
'link1'}
# 这些属性储存的结构是字典
>>> type(tag.attrs)
<class 'dict'>
# 查看属性中id属性的值
>>> soup.a.attrs['id']
'link1'
# 查看a标签的内容
>>> tag.string
'Basic Python'
# 查看a标签的内容
>>> soup.a.string
'Basic Python'
# 查看p标签的内容
>>> soup.p.string
'The demo python introduces several python courses.'
# 查看p标签的内容的类型
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("

p标签的内容部分

"
,"html.parser") # 打印b标签 >>> soup.b <b><!--注释--></b> # 但是打印内容的时候只是将文本打印,并没有注释的样式 >>> soup.b.string '注释' # 通过类型可以知道这个是一个注释 >>> type(soup.b.string) <class 'bs4.element.Comment'> # 打印p标签内容的时候也是打印的文本 >>> soup.p.string 'p标签的内容部分' # 这个是p标签的内容的类型 >>> type(soup.p.string) <class 'bs4.element.NavigableString'> # 要区别是不是注释只能通过类型进行拍断typt()

基于bs4库的Html内容遍历的方法

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-N9svBJyb-1611882746568)(D:\微信公众号相关\计算机相关\计算机图片\python\遍历3种.png)]

# 三种遍历方式的了解
# 1.自下而上的遍历是从叶子节点向根节点
# 2.自上而下的遍历是从根节点向叶子节点
# 3.平行遍历是平级标签的遍历

下行遍历

属性 说明
.contents 子节点的列表,将所有儿子节点存入列表
.children 子节点的迭代类型,也是用于循环所有儿子节点
.descendants 子孙节点的迭代类型,循环遍历所有子孙节点
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head
# 获得head标签的子节点列表
>>> soup.head.contents
# 获得body标签的子节点列表
>>> soup.body.contents
# body子节点列表的长度
>>> len(soup.body.contents)
# 解锁body子节点列表的第二个标签
>>> soup.body.contents[1]
# 遍历子节点
>>> for child in soup.body.children:
...     print(child)
# 遍历子孙节点
>>> for child in soup.body.descendants:
...     print(child)

上行遍历

属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. Yo
u can learn Python from novice to professional by tracking the following courses
:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Bas
ic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001
870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent
>>>
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
# 打印a标签  标签树的上行遍历
>>> for parent in soup.a.parents:
...     if parent is None:
...             print(parent)
...     else:
...             print(parent.name)
...
p
body
html
[document]

平行遍历

属性 说明
.next_ sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_ sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_ siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_ siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.next_sibling.next_sibling.next_sibling
'.'
>>> soup.a.next_sibling.next_sibling.next_sibling.next_sibling
>>>

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
# 遍历前序节点
>>> for psibling in soup.a.previous_siblings:
...     print(psibling)
...
Python is a wonderful general-purpose programming language. You can learn Python
 from novice to professional by tracking the following courses:

        
# 遍历后序节点 
>>> for nsibling in soup.a.next_siblings:
...     print(nsibling)
...
 and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"
>Advanced Python</a>
.
>>>

bs4关于Html 格式输出

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. Yo
u can learn Python from novice to professional by tracking the following courses
:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Bas
ic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001
870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>
\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several p
ython courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful
general-purpose programming language. You can learn Python from novice to profes
sional by tracking the following courses:\n   <a class="py1" href="http://www.ic
ourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="lin
k2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Pyt
hon from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">

    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="lin
k2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>
>>> soup.a.prettify()
'<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n
 Basic Python\n</a>\n'
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>


>>>
>>> from bs4 import BeautifulSoup
>>> soupZN = BeautifulSoup("一段中文","html.parser")
>>> soupZN.b.string
'一段中文'
>>> print(soupZN.b.prettify())
<b>
 一段中文
</b>

信息标记的三种形式

信息标记

  • 形成信息组织结构,增加信息维度
  • 标记的结构与信息一样具有重要价值
  • 标记后的信息可用于通信、存储或展示
  • 标记后的信息更利于程序理解和运用

XML格式

XML:eXtensible Markup Language,可扩展的标记语言。

主要的格式如下:
<name>...</name> # 常见的格式
<name /> # 空元素缩写形式
<!-- --> # 注释书写形式

JSON格式

JSON:JavsScript Object Notation.

主要格式如下:
"key":"value"
"key":["value1","value2"]
"key":{
     "subke":"value"}

YAML格式

YAML:YAML Ain’t Markup Language。

主要格式如下:
key:value
key:#comment # 用"#"表示注释
-value1 # 用"-"表示并列关系
-value2
key:
    subkey:subvalue # 用缩进表示所属关系

YAML跟Python一样,用缩进表示所属关系,用"-“表示并列关系,用”#"表示注释,用“|”表达整块数据。

信息提取

方法一

完整解析信息的标记形式,再提取关键信息
文本格式:XML JSON YAML
需要标记解析器
如bs库的标签树遍历
优点:信息解析准确
缺点:提取过程繁琐
这里有一个要求是需要对文件的组织形式十分的了解

方法二

无标记形式,直接搜索关键信息
对文本进行查找:搜索
对信息的文本查找函数即可
优点:过程简洁,速度较快
缺点:提取结果准确性与内容相关

融合方法

融合方法:结合形式解析与搜索方法,提取关键信息。
文本格式:XML JSON YAML
对文本进行查找:搜索
需要标记解析器及文本查找函数。

实例提取HTML中所有URL链接

1)搜索到所有a标签
2)解析a标签格式,提取href后的链接内容。

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

基于bs4的内容查找

<>.find_all(name,attrs,recursive,string,**kwargs) 返回一个列表类型,储存查找的结果
name 对标签名称的检索字符串
attrs 对标签属性值的检索字符串,可标注属性检索
recursive 是否对子孙全部检索,默认True(布尔型)
string <>…中字符串区域的检索字符串

示例:显示所有a标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Ba
sic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-100187
0001" id="link2">Advanced Python</a>]

示例:显示所有a标签,b标签([‘a’,‘b’]:以列表形式作为参数)

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href=
"http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a cl
ass="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Adva
nced Python</a>]

示例:显示所有标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for tag in soup.find_all(True):
...     print(tag.name)
...
html
head
title
body
p
b
p
a
a

re模块是python独有的匹配字符串的模块,该模块中提供的很多功能是基于正则表达式实现的,而正则表达式是对字符串进行模糊匹配,提取自己需要的字符串部分,他对所有的语言都通用

示例:显示以b开头的标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
...     print(tag.name)
...
body
b

示例:显示包含course属性的p标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. Y
ou can learn Python from novice to professional by tracking the following course
s:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Bas
ic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001
870001" id="link2">Advanced Python</a>.</p>]

示例:显示id属性值是link1的标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Ba
sic Python</a>]

示例:显示id属性值有link的标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

示例:显示子孙标签中a标签和只显示当前标签的儿子节点的a标签

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>> soup.find_all(string='Basic Python')
['Basic Python']

示例:查询<>…中字符串区域的检索字符串

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
# 检索字符串是Basic Python
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python course
s.']
>>> soup.find_all(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Pyth
on from novice to professional by tracking the following courses:\r\n', 'Basic P
ython', 'Advanced Python']

示例:查询包含特定字符串

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> import re
# 检索字符串中有python
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python course
s.']
 # 检索字符串中有Python
>>> soup.find_all(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Pyth
on from novice to professional by tracking the following courses:\r\n', 'Basic P
ython', 'Advanced Python']
# soup.find_all()=soup()
# 检索字符串中有Python
>>> soup(string=re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Pyth
on from novice to professional by tracking the following courses:\r\n', 'Basic P
ython', 'Advanced Python']
方法 说明
<>.find() 搜索且只返回一个结果,字符串类型,同.find_ all()参 数
<>.find parents() 在先辈节点中搜索,返回列表类型,同.find all()参数
<>.find_ parent() 在先辈节点中返回一个结果,字符串类型,同.find()参数
<>.find next siblings() 在后续平行节点中搜索,返回列表类型,同.find_ all()参数
<>.find_ next sibling() 在后续平行节点中返回一个结果,字符串类型,同.find()参数
<>.find previous_ siblings() 在前序平行节点中搜索,返回列表类型,同.find_ all()参数
<>.find previous_ sibling() 在前序平行节点中返回一个结果,字符串类型,同.find()参数

你可能感兴趣的:(python)