网页抓取方式(六)--python/urllib3/BeautifulSoup

一、简介

本文介绍使用python语言进行网页抓取的方法。在此使用urllib3(urllib2也可以的,但容易被查封)进行网页抓取,

使用BeautifulSoup对抓取的网页进行解析。

二、注意

1、使用BeautifulSoup对html解析时,当使用css选择器,对于子元素选择时,要将nth-child改写为nth-of-type才行,

如  ul:nth-child(1)   应该写为   ul:nth-of-type(1)   ,否则会报错Only the following pseudo-classes are implemented: nth-of-type.

二、实例代码

#! /usr/bin/evn python
from bs4 import BeautifulSoup
import urllib3

def get_html(url):
    try:
        userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
        http = urllib3.PoolManager(timeout=2)
        response = http.request('get', url, headers={'User_Agent': userAgent})
        html = response.data
        return html
    except Exception, e:
        print e
        return None

def get_soup(url):
    if not url:
        return None
    try:
        soup = BeautifulSoup(get_html(url))
    except Exception, e:
        print e
        return None
    return soup

def get_ele(soup, selector):
    try:
        ele = soup.select(selector)
        return ele
    except Exception, e:
        print e
    return None

def main():
    url = 'http://www.ifeng.com/'
    soup = get_soup(url)
    ele = get_ele(soup, '#headLineDefault > ul > ul:nth-of-type(1) > li.topNews > h1 > a')
    headline = ele[0].text.strip()
    print headline

if __name__ == '__main__':
    main()

你可能感兴趣的:(python,爬虫)