使用lxml解析HTML数据

HTML数据解析

诸如爬虫类场景下我们需要对抓取的HTML做内容解析,提取感兴趣的内容,python标准库提供了HTMLParser\SGMLParser两个模块用于解析HTML,然而这两个模块的实现方式都很难理解,用来做遍历查找实在是很不友好,第三方库lxml则简单许多,逻辑上更容易理解,而且同时支持HTML和XML两类结构化数据解析

用官方话说:

“lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).”

Parsing HTML with lxml

从html中提取感兴趣的内容, 一种选择是用正则表达式, 不过正则表达式写起来太痛苦,万不得已不用也罢。html语言可以看做是类似xml的层次化结构语言, 可以解析成一个树,然后用xpath语言做数据定位提取.

实现一个小爬虫的思路

Python Documentation 中 HTMLParser章节Example中用一个网站做演示如何使用HTMLParser解析HTML,这里我也借用这个网站做演示,该网站总工有10页,页面下方有“Next”链接到下一页,内容是罗列一堆名人名言,关键信息为“名言”、“作者”、“关键字”,我就遍历这10个页面并提取这三个信息。

http://quotes.toscrape.com

  • requests.get(url)抓取一个链接的页面;
  • 抓取的页面字符形式喂给 lxml.html.fromstring();
  • XPath定位并提取感兴趣的内容;
  • 数据写入MySQL;

代码 & Walk Through


#-*- coding:utf-8 -*-
'''
Created on 2017年7月3日

@author: will
'''
import MySQLdb
from lxml import html
import requests

class Pipeline():
    '''
    数据库连接,已在MySQL Server上提前创建db = Locust
    '''
    connDB = MySQLdb.connect(
        host = '192.168.8.82',
        port = 3306,
        user = 'willyan',
        passwd = 'will392891',
        db = 'Locust',
        charset = 'utf8'
        )
    cur = connDB.cursor()
    
class HtmlPar():
    '''
    解析并提取html文件中的感兴趣信息,
    '''    
    def myPar(self,start_url):
        ##创建urls列表,用于存放待爬取的页面链接
        ##爬虫起始页链接start_url需要作为参数传入并存放到urls[],提取页面底部“Next"的 href 添加到urls[]
        urls = []
        urls.append(start_url)
        ##创建三个list分别用于存放提取到的名言、作者、关键字
        text = []
        author = []
        tags = []

        '''
        定义一个条件循环体,从urls[]中提取待爬取页面的链接,爬取结果以字符形式喂给解析器,提取“Next”元素,若“Next”元素存在,则将其“href”信息添加到urls[]列表中,作为下一次循环爬取的目标链接,同时提取页面中的全部“名言Text”、“作者author”、“关键字 tags”分别添加到对应的list,当爬取的页面中定位不到“Next"元素时说明已到达最后一页跳出循环并将提取到的三个List返回。
        '''       
        i = 0
        while True:
            #从urls[]依次取待爬取页面链接,爬取结果以字符形式喂给解析器
            page = requests.get(urls[i])
            tree = html.fromstring(page.text)
            #提取页面底部的“Next”元素,作为判读是否继续爬取的依据
            nextPage = tree.xpath('//nav/ul/li[@class="next"]/a/@href')
            ##当nextPage 返回为空[]时,说明已到末页,应终止循环并将提取到的全部数据返回
            if nextPage != []:
                #提取当前页面“Next”元素的“href”链接数据,并添加到urls[]作为下一次循环的爬取目标
                urls.append(urls[0] + tree.xpath('//nav/ul/li[@class="next"]/a/@href')[0])
                #提取的名言、作者、关键字三个信息都是以list[]形式返回,以len()函数识别其中一个对象的长度(如名言或作者),定义for循环将返回的三个list[]内容依次添加到text[]、author[]、tags[]中。
                num = len(tree.xpath('//span[@itemprop="text"]/text()'))
                for x in range(num):
                    text.append(tree.xpath('//span[@itemprop="text"]/text()')[x])
                    author.append(tree.xpath('//small[@itemprop="author"]/text()')[x])
                    tags.append(tree.xpath('//meta/@content')[x])
            else:
                return text, author, tags
            i += 1
        
if __name__ == '__main__':
    #数据库中建表qutes,用于存放抓取的数据
    db = Pipeline()
    conn = db.connDB
    cur = db.cur
    dbCreateCMD = 'create table quotes(quoteID varchar(10), quoteText varchar(600), author varchar(20), tags varchar(20), primary key (quoteID), unique(quoteID)) ENGINE=InnoDB DEFAULT CHARSET=utf8'
    cur.execute(dbCreateCMD)
    
    #定义起始爬取页
    start_url = 'http://quotes.toscrape.com'
    quotes = HtmlPar()
    result =  quotes.myPar(start_url)
    #将返回的三维元组数据循环写入数据库,返回数据格式为: result(text[...],author[...],tags[...])
    for y in range(len(result[0])):
        #Text部分有的条目字符数太多,超过MySQL字符限制无法写入,所以text部分就不写库了。。。
        cmd = "insert ignore into quotes(quoteID, author, tags) values('" + str(y+1) + "', '" + result[1][y] + "', '" + result[2][y] + "')"
        cur.execute(cmd)
        conn.commit()
        
    cur.close()
    conn.close()

执行结果

进数据库Locust查看,总计抓取了90条内容。

mysql> select * from quotes;
Empty set (0.01 sec)

mysql> select * from quotes;
+---------+-----------+----------------------+----------------------+
| quoteID | quoteText | author               | tags                 |
+---------+-----------+----------------------+----------------------+
| 1       | NULL      | Albert Einstein      | change,deep-thoughts |
| 10      | NULL      | Steve Martin         | humor,obvious,simile |
| 11      | NULL      | Marilyn Monroe       | friends,heartbreak,i |
| 12      | NULL      | J.K. Rowling         | courage,friends      |
| 13      | NULL      | Albert Einstein      | simplicity,understan |
| 14      | NULL      | Bob Marley           | love                 |
| 15      | NULL      | Dr. Seuss            | fantasy              |
| 16      | NULL      | Douglas Adams        | life,navigation      |
| 17      | NULL      | Elie Wiesel          | activism,apathy,hate |
| 18      | NULL      | Friedrich Nietzsche  | friendship,lack-of-f |
| 19      | NULL      | Mark Twain           | books,contentment,fr |
| 2       | NULL      | J.K. Rowling         | abilities,choices    |
| 20      | NULL      | Allen Saunders       | fate,life,misattribu |
| 21      | NULL      | Pablo Neruda         | love,poetry          |
| 22      | NULL      | Ralph Waldo Emerson  | happiness            |
| 23      | NULL      | Mother Teresa        | attributed-no-source |
| 24      | NULL      | Garrison Keillor     | humor,religion       |
| 25      | NULL      | Jim Henson           | humor                |
| 26      | NULL      | Dr. Seuss            | comedy,life,yourself |
| 27      | NULL      | Albert Einstein      | children,fairy-tales |
| 28      | NULL      | J.K. Rowling         |                      |
| 29      | NULL      | Albert Einstein      | imagination          |
| 3       | NULL      | Albert Einstein      | inspirational,life,l |
| 30      | NULL      | Bob Marley           | music                |
| 31      | NULL      | Dr. Seuss            | learning,reading,seu |
| 32      | NULL      | J.K. Rowling         | dumbledore           |
| 33      | NULL      | Bob Marley           | friendship           |
| 34      | NULL      | Mother Teresa        | misattributed-to-mot |
| 35      | NULL      | J.K. Rowling         | death,inspirational  |
| 36      | NULL      | Charles M. Schulz    | chocolate,food,humor |
| 37      | NULL      | William Nicholson    | misattributed-to-c-s |
| 38      | NULL      | Albert Einstein      | knowledge,learning,u |
| 39      | NULL      | Jorge Luis Borges    | books,library        |
| 4       | NULL      | Jane Austen          | aliteracy,books,clas |
| 40      | NULL      | George Eliot         | inspirational        |
| 41      | NULL      | George R.R. Martin   | read,readers,reading |
| 42      | NULL      | C.S. Lewis           | books,inspirational, |
| 43      | NULL      | Marilyn Monroe       |                      |
| 44      | NULL      | Marilyn Monroe       | girls,love           |
| 45      | NULL      | Albert Einstein      | life,simile          |
| 46      | NULL      | Marilyn Monroe       | love                 |
| 47      | NULL      | Marilyn Monroe       | attributed-no-source |
| 48      | NULL      | Martin Luther King J | hope,inspirational   |
| 49      | NULL      | J.K. Rowling         | dumbledore           |
| 5       | NULL      | Marilyn Monroe       | be-yourself,inspirat |
| 50      | NULL      | James Baldwin        | love                 |
| 51      | NULL      | Jane Austen          | friendship,love      |
| 52      | NULL      | Eleanor Roosevelt    | attributed,fear,insp |
| 53      | NULL      | Marilyn Monroe       | attributed-no-source |
| 54      | NULL      | Albert Einstein      | music                |
| 55      | NULL      | Haruki Murakami      | books,thought        |
| 56      | NULL      | Alexandre Dumas fils | misattributed-to-ein |
| 57      | NULL      | Stephenie Meyer      | drug,romance,simile  |
| 58      | NULL      | Ernest Hemingway     | books,friends,noveli |
| 59      | NULL      | Helen Keller         | inspirational        |
| 6       | NULL      | Albert Einstein      | adulthood,success,va |
| 60      | NULL      | George Bernard Shaw  | inspirational,life,y |
| 61      | NULL      | Charles Bukowski     | alcohol              |
| 62      | NULL      | Suzanne Collins      | the-hunger-games     |
| 63      | NULL      | Suzanne Collins      | humor                |
| 64      | NULL      | C.S. Lewis           | love                 |
| 65      | NULL      | J.R.R. Tolkien       | bilbo,journey,lost,q |
| 66      | NULL      | J.K. Rowling         | live-death-love      |
| 67      | NULL      | Ernest Hemingway     | good,writing         |
| 68      | NULL      | Ralph Waldo Emerson  | life,regrets         |
| 69      | NULL      | Mark Twain           | education            |
| 7       | NULL      | André Gide           | life,love            |
| 70      | NULL      | Dr. Seuss            | troubles             |
| 71      | NULL      | Alfred Tennyson      | friendship,love      |
| 72      | NULL      | Charles Bukowski     | humor                |
| 73      | NULL      | Terry Pratchett      | humor,open-mind,thin |
| 74      | NULL      | Dr. Seuss            | humor,philosophy     |
| 75      | NULL      | J.D. Salinger        | authors,books,litera |
| 76      | NULL      | George Carlin        | humor,insanity,lies, |
| 77      | NULL      | John Lennon          | beatles,connection,d |
| 78      | NULL      | W.C. Fields          | humor,sinister       |
| 79      | NULL      | Ayn Rand             |                      |
| 8       | NULL      | Thomas A. Edison     | edison,failure,inspi |
| 80      | NULL      | Mark Twain           | books,classic,readin |
| 81      | NULL      | Albert Einstein      | mistakes             |
| 82      | NULL      | Jane Austen          | humor,love,romantic, |
| 83      | NULL      | J.K. Rowling         | integrity            |
| 84      | NULL      | Jane Austen          | books,library,readin |
| 85      | NULL      | Jane Austen          | elizabeth-bennet,jan |
| 86      | NULL      | C.S. Lewis           | age,fairytales,growi |
| 87      | NULL      | C.S. Lewis           | god                  |
| 88      | NULL      | Mark Twain           | death,life           |
| 89      | NULL      | Mark Twain           | misattributed-mark-t |
| 9       | NULL      | Eleanor Roosevelt    | misattributed-eleano |
| 90      | NULL      | C.S. Lewis           | christianity,faith,r |
+---------+-----------+----------------------+----------------------+
90 rows in set (0.01 sec)

多线程优化

对于页面数较多的站点爬取可以考虑使用multiprocessing库做多线程处理,先爬取所有页面的链接,再以多线程做爬取页面和数据提取以提高爬虫效率。

你可能感兴趣的:(Python)