Python爬虫之网页下载器网页解析器

一、网页下载器 -- urllib2的三种网页下载方法


import cookielib
import urllib2

url = "http://www.baidu.com"
print 'first method'
#直接请求
response1 = urllib2.urlopen(url)
#获取状态码,如果是200表示获取成功
print response1.getcode()
#读取内容response1.read()

print len(response1.read())

print 'second'
#添加data、URL、http header
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print 'thired method'
#添加特殊情景的处理器
#创建cookie容器
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

二、网页解析器

正则表达式(模式匹配)

html.parser(结构化解析-DOM)

Beautiful Soup

lxml


Beautiful Soup

--Python第三方库,用于从HTML或XML中提取数据

--官网:http://www.crummy.com/software/BeautifulSoup/

安装并测试beautifulsoup4

--安装:下载后放入Python目录下,cmd窗口进入解压后的文件cd beautifulsoup4-4.1.2,setup.py build,setup.py install

--测试:import bs4

三、Beautiful Soup语法

创建Beautiful Soup对象 -> 搜索节点(find_all(name,attrs,string)    find()) -> 访问节点信息()

下面是BeautifulSoup实例测试:

解析网页字符串:

#coding=UTF-8
import re
from bs4 import BeautifulSoup
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8') print '获取所有的链接' links = soup.find_all('a') for link in links: print link.name,link['href'],link.get_text() print '获取 elsie 的链接' link_node = soup.find('a',href='http://example.com/elsie') print link_node.name,link_node['href'],link_node.get_text() print '正则表达式' link_node = soup.find('a',href=re.compile(r"sie")) print link_node.name,link_node['href'],link_node.get_text() print '获取p段落文字' p_node = soup.find('p',class_='title') print p_node.name,p_node.get_text()


四、实例爬虫

抓取百度页面

-----------确定目标:百度百科Python词条相关词条的爬取标题简介

-----------分析目标

----------------------URL格式:/view/125370.htm

----------------------数据格式:标题:

***

  简介:

***

----------------------网页编码:UTF-8

------------编写代码

------------执行爬虫


你可能感兴趣的:(Python)