简易BeautifulSoup项目,爬取廖雪峰的Python教程阅读数

-- coding: utf-8 --

"""
简易BeautifulSoup项目,爬取廖雪峰的Python教程阅读数

@author: yunpoyue
"""
import urllib.request
from bs4 import BeautifulSoup

备用url

urlBegin = "http://www.liaoxuefeng.com"
url = "http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"

读取网页代码

html = urllib.request.urlopen(url).read()

使用BS4处理后得到整个页面的soup和要找的部分soup2。

soup = BeautifulSoup(html, 'html.parser')
menu = soup.find_all(id="x-offcanvas-left") # 左侧目录列表
values = ','.join(str(v) for v in menu)
soup2 = BeautifulSoup(values, 'html.parser')
soup2 = soup2.find_all("ul", "uk-nav uk-nav-side")
soup2 = soup2[1].find_all('a') # 取出目录列表链接

分别取目录、目录链接、阅读量

bookMenu = []
bookMenuUrl = []
readnumber = []
for i in range(0, len(soup2) - 1):
bookMenu.append(soup2[i].get_text())
bookMenuUrl.append(soup2[i].attrs['href'])
for i in range(0, len(bookMenuUrl)):
chapterCode = urllib.request.urlopen(urlBegin + bookMenuUrl[i]).read()
chapterSoup = BeautifulSoup(chapterCode, 'html.parser')
chapterResult = chapterSoup.find_all('span')
readnumber.append(chapterResult[2].get_text())

将结果写入本地文件

f = open('c://dev/python教程.txt', 'a', encoding='utf8')
for i in range(len(bookMenu)):
f.write(bookMenu[i] + '-' + readnumber[i] + '\n')

你可能感兴趣的:(简易BeautifulSoup项目,爬取廖雪峰的Python教程阅读数)