python Beautiful Soup分析网页

Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。它可以大大节省你的编程时间。

使用python开发网页分析功能时,可以借用该库的网页解析功能,时分方便,比自己写正则方便很多,使用时需要引入模块,如下:

在程序中中导入 Beautiful Soup库:

from BeautifulSoup import BeautifulSoup          # For processing HTML
from BeautifulSoup import BeautifulStoneSoup     # For processing XML
import BeautifulSoup                             # To get everything

 

Beautiful Soup对html处理比较好,对xml处理不是特别完美,如下:

 

#! /usr/bin/python
#coding:utf-8
from BeautifulSoup import BeautifulSoup
import re

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()

输出如下:

# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>


当然它的功能很强大,下面是一个从网页中提取title的例子,如下:

 

#!/usr/bin/env python
#coding:utf-8
import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com","http://ibm.com"]
queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()
            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()
            #place chunk into out queue
            self.out_queue.put(chunk)
            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()
            #parse the chunk
            soup = BeautifulSoup(chunk)
            print soup.findAll(['title'])
            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():
    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()
    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)


该例子用到了多线程和队列,队列可以简化多线程开发,即分而治之的思想,一个线程只有一个独立的功能,通过队列共享数据,简化程序逻辑,输出结果如下:

 

[<title>IBM - United States</title>]
[<title>Google</title>]
[<title>Yahoo!</title>]
[<title>Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more</title>]
Elapsed Time: 12.5929999352


中文文档:http://www.crummy.com/software/BeautifulSoup/documentation.zh.html

官方地址:http://www.crummy.com/software/BeautifulSoup/#Download/
 

你可能感兴趣的:(python Beautiful Soup分析网页)