准备:
Beatiful Soup库:该库可以从HTML或XML文件中提取数据,通过转换器实现常规的文档导航,查找,修改等操作,该库需要安装后使用
目标:
编写爬虫,爬取百度百科“网络爬虫”的词条("http://baike.baidu.com/view/284853.htm"),并将所有包含“view”关键字的链接按格式打印出来
实现过程:
首先使用前先使用urllib.request模块从指定网址上读取HTML文件
>>>import urllib.request
>>>from bs4 import BeatifulSoup
>>>url = "http://baike.baidu.com/view/284853.htm"
>>>response = urllib.request.urlopen(url)
>>>html = response.read()
>>>soup = BeatifulSoup(html,"html.parser")
BeatifulSoup需要两个参数,第一个参数是所提取数据的所在HTML或XML文件,第二个参数是指定解析器,然后使用find_all(href = re.compile("view"))方法来读取所有包含“view”关键字的链接(使用正则表达式知识),使用for语句迭代读取
>>>import re
>>>for each in soup.find_all(href = re.compile("view")):#???
print(each.text,"- >","".join(["http://baike.baidu.com",\each["href"]]))
最终代码为
for each in soup.find_all(href = re.compile("view")):#???
print(each.text,"- >","".join(["http://baike.baidu.com",each["href"]]))
import urllib.request
from bs4 import BeatifulSoup
import re
def main():
url = "http://baike.baidu.com/view/284853.htm"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeatifulSoup(html,"html.parser")
for each in soup.find_all(href = re.compile("view")):#???
print(each.text,"- >","".join(["http://baike.baidu.com",each["href"]]))
if _ _name_ _ == "_ _main_ _": #只有单独运行该.py文件才会执行main()
main()
目标:爬虫接收用户输入的关键词,进入每一个词条,然后检测该词条是否有副标题,若有,则打印出来
代码清单
import urllib.request
import urllib.parse
from bs4 import BeatifulSoup
import re
def main():
keyword = input("请输入关键词:") keyword = urllib.parse.urlencode({"word":keyword}) #?
response = \
urllib.request.urlopen("http://baike.baidu.com/search/word?%s"%\
keyword)
html = response.read()
soup = BeatifulSoup(html,"html.parser") #操作文件中数据,如查找等
for each in soup.find_all(href = re.compile("view")):
content = ''.join([each.text])
ur12 = ''.join(["http://baike.baidu.com",each["href"]])
response2 = urllib.request.urlopen(ur12)
html2 = response2.read()
soup2 = BeatifulSoup(html2,"html.parser")
if soup2.h2:
content = ''.join([content,soup2.h2.text])
content = ''.join([content,"->",ur12])
if _ _name_ _ == "_ _main_ _": #只有单独运行该.py文件才会执行main()
main()
上述程序某些语句语法待继续学习