爬虫之当当网手写实战总结

当当网爬虫手动实现

一、     分析url

当当网计算机网络网址为:

http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA%CD%F8%C2%E7\&page_index=1

分析:key代表关键字编码,index为页面号,从网站中看出页面号有100页。所以在main中,page设置为:100。并且通过getUrls函数返回待爬取的url列表。

 

def getUrls(url,page=''):
    urls = []
    urls.append(url)
    page =
int(page)+1
   
for i in range(2,page):
        urls.append(url[
0:len(url)-1]+str(i))
       
print(urls[i-1])
   
return urls

 
  


二、     爬取

在这里我是通过Requests包直接对url进行的爬取,并将response返回到主函数中。

def getResopnse(url,opener=''):

    response = ''

    try:

        response =requests.get(url)

    except:

        print("错误00")

    return response


三、     数据分析

 

1、  需要提取的数据:书的名字,书的链接,书的评论数。

2、  设置定义数据的文件items.py,定义类DDItems,书名:title,链接:link,评论数:commentCount。

3、  通过xpath进行匹配

以下是匹配的xpath表达式

xpatTitle       =   '//a[@name="itemlist-title"]/@title'
xpatLink        =   '//a[@name="itemlist-title"]/@href'
xpatComCount    =   '//a[@class="search_comment_num"]/text()'
 
4、  通过getPathList函数返回匹配表,并存入定义的数据结构中
 

四、     进行数据处理

依次便利item中的每个书的对应信息,并保存到数据库中,数据库设定为dd,表格名为books7

详见代码:

Main.py: 

import geturl

from itmes  import DDItems

import re

import pymysql

import piplimens

import getOpener

import getRequest

import lxml

import lxml.html as HTML

import lxml.etree as etree

# 著代码

#获取列表

url = 'http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA%CD%F8%C2%E7&page_index=1'

page = 100

urls = geturl.getUrls(url,page)

conn = pymysql.connect(host='localhost', user="root", password="a123456789", database='dd')

sql = "create table books7(tilte VARCHAR(1000),link VARCHAR (1000),bookcomment VARCHAR (1000))"

conn.query(sql)

conn.commit()

#获取

def getPathList(response,xpat=''):

    list = []

    data = response.text

    tree = etree.HTML(data)

    list = tree.xpath(xpat)

    try:

       # list = re.compile(xpat,re.S).findall(data)

        print(list)

    except:

        print("错误!xpat表达式为空或有误!")

    return list



for url in urls:

    response    = getRequest.getResopnse(url)

    #数据化处理

    item = DDItems()



    xpatTitle       =   '//a[@name="itemlist-title"]/@title'

    xpatLink        =   '//a[@name="itemlist-title"]/@href'

    xpatComCount    =   '//a[@class="search_comment_num"]/text()'

    """

    xpatTitle = 'alt=\".+?\"'

    xpatLink = 'href=\".+?\" traget=\"_blank\"'

    xpatComCount = 'dd_name=\"单品评论\".+?'

    """

    item.title = getPathList(response,xpatTitle)

    item.link = getPathList(response, xpatLink)

    item.commentCount = getPathList(response,xpatComCount)

# print("处理")

    piplimens.process_item(item)

#print("加入数据库")

 
  


Item.py:

class DDItems():

    title   =   []

    link    =   []

    commentCount =  []

    def __init__(self):

        print("正在初始化ITEM")

piplimens.py

import pymysql
def process_item(item):
#数据库链接
    conn = pymysql.connect(host='localhost', user="root", password="a123456789", database='dd')
    print(len(item.title))
    for i in range(0,len(item.title)):
        title   =   item.title[i]
        link    =   item.link[i]
        coment  =   item.commentCount[i]

        sql = "insert into books7(tilte,link,bookComment) values"
        values = "("+'"'+title+'"'+","+'"'+link+'"'+","+'"'+coment+'"'+")"
        sql = sql+values;
        print(sql)
        try:
            conn.query(sql)
            conn.commit()
            print("插入成功")
        except:
            print("插入失败")
    return item

geturl.py:

def getUrls(url,page=''):
    urls = []
    urls.append(url)
    page = int(page)+1
    for i in range(2,page):
        urls.append(url[0:len(url)-1]+str(i))
        print(urls[i-1])
    return urls

getRequest.py

import urllib.request
import getOpener
import zlib
import codecs
import requests
from xml    import etree
def getResopnse(url,opener=''):
    if(opener==''):
        opener = getOpener.getOpenerf()
    urllib.request.install_opener(opener)
    response = ''
    try:
        response = requests.get(url)
    except:
        print("错误00")
    return response

getOpener.py
import urllib.request
def getOpenerf(header=''):
    headers = {
        "Accept": "text/html,Applicat/xhtml+xml,Application/xml;q=0.9,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        "User-Agent": "Fiddler/5.0.20181.14850 (.NET 4.6.2; WinNT 10.0.17134.0; zh-CN; 4xAMD64; Auto Update; Full Instance; Extensions: APITesting, AutoSaveExt, EventLog, FiddlerOrchestraAddon, HostsFile, RulesTab2, SAZClipboardFactory, SimpleFilter, Timeline)",
        "Connection": "keep-alive",
        "referer": "http://www.163.com/"
    }
    headll = []
    for key,value in headers.items():
        item = (key, value)
        headll.append(item)

    opener = urllib.request.build_opener()
    opener.addheaders = headll
    return opener



五、     总结反思

1、  通过requests可以查看爬取网页的头信息:

Re = requests.get(url)

Info = Re.info()

2、  使用xpath需要将爬取的内容设置成xml格式。

流程:

导入import lxml
import lxml.html as HTML
import lxml.etree as etree
           list = []
data = response.text
tree = etree.HTML(data)
list = tree.xpath(xpat)
       其中:response为通融过requests.get(url)返回的响应。
3、  有些爬取的网页可能按照一定的格式压缩了,需要对之进行解压然后再读取。
4、连接数据库时,要提交了事务信息,sql语句才会被执行,否则再程序意外终止后不会影响到数据库中。
5、要熟练掌握urllib库和requests库,相互结合借鉴才能解决到实际问题。

 


你可能感兴趣的:(PythonSpider)