莫烦python
学到爬虫小练习-爬取百度百科相关网页,现在做个学习总结
从网络爬虫这一页开始
先插入接下来所需要的模块
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
其次,确定网址,尽量不能有中文;将/item/…的网页放在his中,使用列表格式,储存接下来爬取的网页
base_url = "https://baike.baidu.com"
his = ["/item/%E8%9C%98%E8%9B%9B/8135707"]
最后使用for循环,确定爬虫的次数,
此时:使用urlopen打开网址,用BeautifulSoup中的lxml解析网页;
用soup.find(“h1”).get_text()先查找h1后面的标题(使用文本形式,所以.get_text()
开始爬取网页:(过滤掉不需要的信息)通过正则表达式,首先找到a标签,然后选取含有target的内容,并且href 必须匹配以/item/开头的形式
在爬取中会遇到的问题:如果一个网页没有可以点击的链接,那么就要返回上一级,直到爬完10次为止。if len(sub_urls) !=0…
for i in range(10):
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(i, soup.find("h1").get_text(), " url:", his[-1])
#确定要爬取的网页,更精准些,过滤不需要的信息,随机抽取网页,
sub_urls = soup.find_all("a",
{"target": "_blank",
"href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) !=0:
his.append(random.sample(sub_urls,1)[0]['href'])
else:
his.pop()
print(his)
爬取的结果如下(每次爬的结果都不一样):
0 网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2’]
1 GUI url: /item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2’, ‘/item/%E7%94%A8%E6%88%B7%E6%8E%A5%E5%8F%A3’]
2 用户接口 url: /item/%E7%94%A8%E6%88%B7%E6%8E%A5%E5%8F%A3
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2’, ‘/item/%E7%94%A8%E6%88%B7%E6%8E%A5%E5%8F%A3’, ‘/item/%E5%90%8E%E5%8F%B0’]
3 后台 url: /item/%E5%90%8E%E5%8F%B0
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2’, ‘/item/%E7%94%A8%E6%88%B7%E6%8E%A5%E5%8F%A3’, ‘/item/%E5%90%8E%E5%8F%B0’, ‘/item/%E4%B9%89%E9%A1%B9’]
…
8 回旋奏鸣曲式 url: /item/%E5%9B%9E%E6%97%8B%E5%A5%8F%E9%B8%A3%E6%9B%B2%E5%BC%8F
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2’, ‘/item/%E7%94%A8%E6%88%B7%E6%8E%A5%E5%8F%A3’, ‘/item/%E5%90%8E%E5%8F%B0’, ‘/item/%E4%B9%89%E9%A1%B9’, ‘/item/%E7%BA%A6%E7%BF%B0%E5%A5%88%E6%96%AF%C2%B7%E5%8B%83%E6%8B%89%E5%A7%86%E6%96%AF’, ‘/item/%E8%B4%9D%E5%A4%9A%E8%8A%AC’, ‘/item/%E5%B0%8F%E6%8F%90%E7%90%B4%E5%8D%8F%E5%A5%8F%E6%9B%B2’, ‘/item/%E5%9B%9E%E6%97%8B%E5%A5%8F%E9%B8%A3%E6%9B%B2%E5%BC%8F’, ‘/item/%E5%9B%9E%E6%97%8B’]
9 回旋 url: /item/%E5%9B%9E%E6%97%8B
[’/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711’, ‘/item/%E5%9B%BE%E5%BD%A2%E7%94%A8%E6%88%B7%E7%95%8C%E9%9D%A2’, ‘/item/%E7%94%A8%E6%88%B7%E6%8E%A5%E5%8F%A3’, ‘/item/%E5%90%8E%E5%8F%B0’, ‘/item/%E4%B9%89%E9%A1%B9’, ‘/item/%E7%BA%A6%E7%BF%B0%E5%A5%88%E6%96%AF%C2%B7%E5%8B%83%E6%8B%89%E5%A7%86%E6%96%AF’, ‘/item/%E8%B4%9D%E5%A4%9A%E8%8A%AC’, ‘/item/%E5%B0%8F%E6%8F%90%E7%90%B4%E5%8D%8F%E5%A5%8F%E6%9B%B2’, ‘/item/%E5%9B%9E%E6%97%8B%E5%A5%8F%E9%B8%A3%E6%9B%B2%E5%BC%8F’]