当当网计算机网络网址为:
http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA%CD%F8%C2%E7\&page_index=1
分析:key代表关键字编码,index为页面号,从网站中看出页面号有100页。所以在main中,page设置为:100。并且通过getUrls函数返回待爬取的url列表。
def getUrls(url,page=''):
urls = []
urls.append(url)
page = int(page)+1
for i in range(2,page):
urls.append(url[0:len(url)-1]+str(i))
print(urls[i-1])
return urls
在这里我是通过Requests包直接对url进行的爬取,并将response返回到主函数中。
def getResopnse(url,opener=''): response = '' try: response =requests.get(url) except: print("错误00") return response
1、 需要提取的数据:书的名字,书的链接,书的评论数。
2、 设置定义数据的文件items.py,定义类DDItems,书名:title,链接:link,评论数:commentCount。
3、 通过xpath进行匹配
以下是匹配的xpath表达式
xpatTitle = '//a[@name="itemlist-title"]/@title' xpatLink = '//a[@name="itemlist-title"]/@href' xpatComCount = '//a[@class="search_comment_num"]/text()'
4、 通过getPathList函数返回匹配表,并存入定义的数据结构中
依次便利item中的每个书的对应信息,并保存到数据库中,数据库设定为dd,表格名为books7。
详见代码:
Main.py:
import geturl
from itmes import DDItems
import re
import pymysql
import piplimens
import getOpener
import getRequest
import lxml
import lxml.html as HTML
import lxml.etree as etree
# 著代码
#获取列表
url = 'http://search.dangdang.com/?key=%BC%C6%CB%E3%BB%FA%CD%F8%C2%E7&page_index=1'
page = 100
urls = geturl.getUrls(url,page)
conn = pymysql.connect(host='localhost', user="root", password="a123456789", database='dd')
sql = "create table books7(tilte VARCHAR(1000),link VARCHAR (1000),bookcomment VARCHAR (1000))"
conn.query(sql)
conn.commit()
#获取
def getPathList(response,xpat=''):
list = []
data = response.text
tree = etree.HTML(data)
list = tree.xpath(xpat)
try:
# list = re.compile(xpat,re.S).findall(data)
print(list)
except:
print("错误!xpat表达式为空或有误!")
return list
for url in urls:
response = getRequest.getResopnse(url)
#数据化处理
item = DDItems()
xpatTitle = '//a[@name="itemlist-title"]/@title'
xpatLink = '//a[@name="itemlist-title"]/@href'
xpatComCount = '//a[@class="search_comment_num"]/text()'
"""
xpatTitle = 'alt=\".+?\"'
xpatLink = 'href=\".+?\" traget=\"_blank\"'
xpatComCount = 'dd_name=\"单品评论\".+?'
"""
item.title = getPathList(response,xpatTitle)
item.link = getPathList(response, xpatLink)
item.commentCount = getPathList(response,xpatComCount)
# print("处理")
piplimens.process_item(item)
#print("加入数据库")
Item.py:
class DDItems():
title = []
link = []
commentCount = []
def __init__(self):
print("正在初始化ITEM")
piplimens.py
import pymysql
def process_item(item):
#数据库链接
conn = pymysql.connect(host='localhost', user="root", password="a123456789", database='dd')
print(len(item.title))
for i in range(0,len(item.title)):
title = item.title[i]
link = item.link[i]
coment = item.commentCount[i]
sql = "insert into books7(tilte,link,bookComment) values"
values = "("+'"'+title+'"'+","+'"'+link+'"'+","+'"'+coment+'"'+")"
sql = sql+values;
print(sql)
try:
conn.query(sql)
conn.commit()
print("插入成功")
except:
print("插入失败")
return item
geturl.py:
def getUrls(url,page=''):
urls = []
urls.append(url)
page = int(page)+1
for i in range(2,page):
urls.append(url[0:len(url)-1]+str(i))
print(urls[i-1])
return urls
getRequest.py
import urllib.request
import getOpener
import zlib
import codecs
import requests
from xml import etree
def getResopnse(url,opener=''):
if(opener==''):
opener = getOpener.getOpenerf()
urllib.request.install_opener(opener)
response = ''
try:
response = requests.get(url)
except:
print("错误00")
return response
getOpener.py
import urllib.request
def getOpenerf(header=''):
headers = {
"Accept": "text/html,Applicat/xhtml+xml,Application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"User-Agent": "Fiddler/5.0.20181.14850 (.NET 4.6.2; WinNT 10.0.17134.0; zh-CN; 4xAMD64; Auto Update; Full Instance; Extensions: APITesting, AutoSaveExt, EventLog, FiddlerOrchestraAddon, HostsFile, RulesTab2, SAZClipboardFactory, SimpleFilter, Timeline)",
"Connection": "keep-alive",
"referer": "http://www.163.com/"
}
headll = []
for key,value in headers.items():
item = (key, value)
headll.append(item)
opener = urllib.request.build_opener()
opener.addheaders = headll
return opener
1、 通过requests可以查看爬取网页的头信息:
Re = requests.get(url)
Info = Re.info()
2、 使用xpath需要将爬取的内容设置成xml格式。
流程:
导入import lxml import lxml.html as HTML import lxml.etree as etree
list = [] data = response.text tree = etree.HTML(data) list = tree.xpath(xpat)
其中:response为通融过requests.get(url)返回的响应。
3、 有些爬取的网页可能按照一定的格式压缩了,需要对之进行解压然后再读取。
4、连接数据库时,要提交了事务信息,sql语句才会被执行,否则再程序意外终止后不会影响到数据库中。
5、要熟练掌握urllib库和requests库,相互结合借鉴才能解决到实际问题。