BeautifulSoup简介与安装

Beautiful Soup简介

BeautifulSoup是python的一个库，最主要的功能是从网页抓取数据(on quick-turnaround screen scraping projects)。

官方描述的3个主要特征为：

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specifyan encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

翻译过来就是：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Beautifulsoup安装（windows）

1、到http://www.crummy.com/software/BeautifulSoup/ 网站上下载，最新版本是4.3.2。
2、下载完成之后需要解压缩，假设放到C:/python27/下。
3、"运行cmd"---"cd c:\python27\beautifulsoup4-4.3.2",切换到c:\python27\beautifulsoup4-4.3.2目录下(根据自己解压后所放的目录和自己的版本号修改)。
4、运行命令：
- setup.py build
- setup.py install
5、python命令下 import bs4，没报错说明安装成功。

新版本的beautifulsoup官方已经将beautifulsoup改名为bs4了。所以不能再使用这样的语句：

from beautifulsoup import beautifulsoup

而应该是:

from bs4 import beautifulsoup

【坑爹啊！】，因为这个折腾了一个多小时。

一个查询某一个 NS 服务器的所有域名的爬虫

原文

通过搜索可找到 sitedossier.com 这个网站可以提供域名服务器的信息。那么就要写个爬虫来抓查询结果了

import urllib2
import re
import argparse
from bs4 import BeautifulSoup 

class Crawler(object):

    def __init__(self, args):
        self.ns = args.ns

    def _getSoup(self, url):
        req = urllib2.Request(
            url = url,
            headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4'}
        )
        content = urllib2.urlopen(req).read()
        return BeautifulSoup(content)

    def _isLastPage(self, soup):
        if not soup.find(text=re.compile("End of list.")):
            return False
        else:
            return True

    def _getItem(self, soup):
        itemList = soup.findAll('li')
        for item in itemList:
            print item.find('a').string

    def _getNextPage(self, soup):
        nextUrl = 'http://www.sitedossier.com' + soup.ol.nextSibling.nextSibling.get('href')
        self.soup = self._getSoup(nextUrl)

    def start(self):
        url = 'http://www.sitedossier.com/nameserver/' + self.ns
        self.soup = self._getSoup(url)
        self._getItem(self.soup)
        while not self._isLastPage(self.soup):
            self._getNextPage(self.soup)
            self._getItem(self.soup)

def main():
    parser = argparse.ArgumentParser(description='A crawler for sitedossier.com') 
    parser.add_argument('-ns', type=str, required=True, metavar='NAMESERVER', dest='ns', help='Specify the nameserver')
    args = parser.parse_args()

    crawler = Crawler(args)
    crawler.start()

if __name__ == '__main__':
    main()

用法：

$ python crawler_ns.py -ns dns.baidu.com

保存结果到文件：

$ python crawler_ns.py -ns dns.baidu.com >> result.txt

BeautifulSoup简介与安装

Beautiful Soup简介

Beautifulsoup安装（windows）

一个查询某一个 NS 服务器的所有域名的爬虫

你可能感兴趣的:(BeautifulSoup简介与安装)