获取一个网页并格式化内容

方法一：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)  #获取一个网页
    except HTTPError as e:  #如果网页不存在，返回None
        return None

    try:
        bs = BeautifulSoup(html.read(),'html.parser')  #格式化传入的网页
        title = bs.body
    except AttributeError as e:  #如果想调用的标签不存在，返回None，此例中调用body标签
        return None
    return title  #返回格式化的网页内容

title = getTitle('http://ww.baidu.com')
if title == None:
    print('Title cound not be found')
else:
    print(title)

方法二：

import requests  #使用requests库
from bs4 import BeautifulSoup

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)  #requests.get()函数
        r.raise_for_status()  #如果连接失败（不是200），产生异常requests.HTTPError
        r.encoding = r.apparent_encoding  #改变编码方式
        demo = r.text  #url对应的页面内容
    except:
        return None

    soup = BeautifulSoup(demo,'html.parser')  #格式化
    return soup.prettify()  #基于bs4库的.prettify()方法，令内容更加友好


url = 'http://www.baidu.com'
h = getHTMLText(url)
if h == None:  #
    print('web could not be found')
else:
    print(h)

BeautifulSoup的find()和find_all()

find_all(tag, attributes, recursive, text, limit, keyowrds)
find(tag, attributes, recursive, text, keywords)

· tag：传递一个/多个标签名组成列表做参数，限定作用的标签。
· attributes：标签属性参数，返回符合属性的标签
· recursive：布尔变量。设置为True：按要求查找所有子标签及子的子；设置为False：只查找文档的一级标签。
· text：用标签的文本内容去匹配。例：查找含某内容的标签数量

nameList = bs.find_all(text='the prince')
print(len(nameList))

· limit：获取网页中前x项结果
· keyword：选择具有指定属性的标签。例：

title = bs.find_all(id='title', class='text')

上述代码返回第一个在class_属性中包含单词text且在id属性中包含title的标签。

其他BeautifulSoup对象

BeautifulSoup对象

前面代码示例中的bs

标签Tag对象

例：bs.div.h1

NavigableString对象

表示标签里的文字

Comment对象

查找HTML文档的注释标签，

导航树

通过标签在文档中的位置来查找标签
1.处理子标签和其他后代标签
一般情况下，BeautifulSoup函数处理当前标签的后代标签。
例：
bs.body.h1：选择body标签后代里的第一个h1标签
bs.div.find_all("img")：找出文档中的第一个div标签，获得div后代里所有img标签的列表

找出子标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)
#打印giftList表格中所有产品的数据行

2.处理兄弟标签
next_siblings()函数：调用后面的兄弟标签

平行标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)
    
#打印产品表格里所有行的产品，第一行表格标题除外。

类似的道理，还有next_sibling、previous_sibling、previous_siblings函数

3.处理父标签
.parent或.parents

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = ('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
print(bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

正则表达式

用一系列线性规则构成的字符串：
1."a"至少出现一次
2.后面跟着"b"，重复5次
3.后面再跟"c"，重复任意偶数次
4.最后一位是"d"或"e"
则：aabbbbb(cc)(d|e)
aa：a表示重复任意次a，包括0次
(cc)*：重复任意次cc，包括0次
(d|e)：增加一个d或一个e

常用的正则表达式符合

符合	含义	例子	匹配结果
*	匹配前面字符0次或多次	ab	aaaaaa，aaabbb，bbbbbb
+	匹配前面字符至少一次	a+b+	aaaaab，aaaabbb，abbbb
[ ]	匹配括号里的任意一个字符	[A-Z]*	APPLE，CAPITALS
( )	优先运行
{m,n}	匹配前面字符m到n次	a{2,3}	aa,aaa
[^]	匹配任意一个不在[ ]里的字符	[^A-Z]*	apple，mysql
\|	匹配任意一个被\|分割的字符	a\|e	a，e
.	匹配任意单个字符	b.d	bad，bzd，b@d,b d
^	字符串开始位置的字符或子表达式	^a	apple,asdf,a
\	转义字符	\.\\|\\	.\|\
$	经常用在正则表达式末尾，表示从字符串的末端匹配	[A-Z][a-z]$	ABCabc,zzzyx,Bob
?!	"不包含"。表示字符不能出现在目标字符串里。

经典应用：识别邮箱地址
规则：
1.大写字母，小写字母，数字，点号，加号，下划线：
[A-Za-z0-9.+]+
2.@：@
3.@之后至少包含一个大写\小写字母：[A-Za-z]+
4.点号：.
5.邮箱地址结尾：com|org|edu|net
答案：[A-Za-z0-9.+]+@[A-Za-z]+.(com|org|edu|net)

正则表达式和BeautifulSoup

抓取网页：http://www.pythonscraping.com/pages/page3.html
上面几张图片的源代码：
如果想抓取所有图片的url链接，很直接的做法是find_all("img")抓取所有图片。但除目标图片外，现代网站里都有一些隐藏的图片、用于网页布局留白、元素对齐的空白图片。所以，我们直接通过商品图片的文件路径来查找：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])

获取属性

获取一个标签对象的全部属性：
myTag.attrs
例：获取图片的源位置src

myImgTag.attrs['src']

Lambda表达式

BeautifulSoup允许把特定类型的函数作为参数传入find_all函数。唯一的限制条件是这些函数必须把一个标签对象作为参数并且返回布尔类型的结果。
BeautifulSoup用这个函数来评估它遇到的每个标签对象，最后把评估结果为“真”的标签保留，把其他标签剔除。
例：获取有两个属性的所有标签：
bs.find_all(lambda tag: len(tag.attrs) == 2)
这里作为参数传入的函数是len(tag.attrs)==2。当参数为真时，find_all函数将返回tag。即找出带有两个属性的所有标签。
Lambda函数非常实用，你甚至可以用它来替代现有的BeautifulSoup函数：
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
不使用Lambda函数：
bs.find_all(' ', text = 'Or maybe he\'s only resting?')

提取页面中的链接

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://cn.bing.com')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

抓取整个网站

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html,'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #we have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

为避免一个页面被抓取两次，链接的去重是非常重要的。在代码运行时，要把已发现的所有链接都放到一起，并保存在方便查询的集合（set）里。集合中的元素没有特定的顺序，集合只存储唯一的元素。

规划和定义对象

1.思考自己需要什么信息？
2.这个信息有助于实现目的吗？是否抓取这个信息有什么影响？
3.晚些抓取会有什么影响？
4.是否冗余？
5.数据存储是否符合逻辑？

网络爬虫