在项目工厂发现一道爬虫案例目标是58同城,可是在对代码分析之后发现样例代码已经无法满足现在的58网的信息搜索。
样例网址如下:
http://www.xmgc360.com:3000/#/spider/docs/urllib_project_58fang
网站更新后源html代码格式已有更改原有xpath已经无法对该网站进行搜索,所以我决定自己构造
构造代码如下:
//ul[@class="house-list"]/li
./div[@class="des"]/h2/a/text()
./div[@class="list-li-right"]/div[@class="money"]/b/text()
./div[@class="des"]/p[@class="room"]/text()
构造好xpath后日设置代理(做贼心虚)以及设置对页面请求的消息头
def addHeaders(user_agent, referer, requestURL,con, addr):
proxy = request.ProxyHandler({con: addr})
opener = request.build_opener(proxy)
request.install_opener(opener)
headers = {'User-Agent': user_agent, 'referer': referer}
rq = request.Request(requestURL, headers=headers)
reponse = request.urlopen(rq)
data = reponse.read().decode("utf-8")
return data
然而正当一切都顺风顺水的时候,我在网页源码中瞅到了这行
emmm。。。感觉到了对新手的恶意,虐菜之心昭然若揭
然而我怎么会这样轻易go die
在爬取完数据后我发现该网站对价格以及房子面积等信息中的数字进行了反爬虫处理(加载下来会乱码)。
仔细阅读了一下大概是采用了某种数字移位操作产生了一种新的字符映射表。
我们只要获取到这个字符映射表就可以定制一个映射字典从而对乱码进行replace操作
思路有了那么就开始操作
首先将字符映射表以base64的方式写在js代码中我们将其复制**(如下图**):
其次构造python代码将base64解码并且保存ttf文件
font_face = 'AAEAAAALAIAAAwAwR1N........AAA' #base64字体文件
b = base64.b64decode(font_face) #将页面中base64字符映射表加载下来存为ttf字体文件
with open('F:/house/58.ttf', 'wb') as f:
f.write(b)
用FontCreator(百度下载)进行解析,解析得出映射表(如下图)
紧接着就可以构造dict将映射表搬进去咯但是要注意搬进去的时候也需要将0~9转码成unicode
diction = {'\u9476': '\u0030', '\u9fa5': '\u0031', '\u9f92': '\u0032', '\u9ea3': '\u0033', '\u9fa4': '\u0034', '\u9a4B': '\u0035', '\u958f': '\u0036',
'\u993c': '\u0037', '\u9E3a': '\u0038', '\u9f64': '\u0039'}#将映射表写成dict
from urllib import request
import json
from lxml import etree
import base64
global price
global title
global room
font_face = 'AAEAAAA.......AAA'
b = base64.b64decode(font_face) #将页面中base64字符映射表加载下来存为ttf字体文件
with open('F:/house/58.ttf', 'wb') as f:
f.write(b)
def addHeaders(user_agent, referer, requestURL,con, addr):
proxy = request.ProxyHandler({con: addr})
opener = request.build_opener(proxy)
request.install_opener(opener)
headers = {'User-Agent': user_agent, 'referer': referer}
rq = request.Request(requestURL, headers=headers)
reponse = request.urlopen(rq)
data = reponse.read().decode("utf-8")
return data
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
referer = 'https://su.58.com/zufang/pn2/?PGTID=0d300008-0000-5386-6138-sss&ClickID=4'
requestURL = 'https://su.58.com/zufang/pn2/?PGTID=0d300008-0000-5386-6138-sss&ClickID=4'
ip_addr = '120.25.219.228:80'
result = []
pageSource = addHeaders(userAgent, referer, requestURL, 'HTTP', ip_addr)
selector = etree.HTML(pageSource)
list = selector.xpath('//ul[@class="house-list"]/li')
for item in list:
title = item.xpath('./div[@class="des"]/h2/a/text()')
price = item.xpath('./div[@class="list-li-right"]/div[@class="money"]/b/text()')
room = item.xpath('./div[@class="des"]/p[@class="room"]/text()')
if title and price and room:
title = title[0]
price = price[0]
room = room[0]
title = title.replace("\n","").strip()
diction = {'\u9476': '\u0030', '\u9fa5': '\u0031', '\u9f92': '\u0032', '\u9ea3': '\u0033', '\u9fa4': '\u0034', '\u9a4B': '\u0035', '\u958f': '\u0036',
'\u993c': '\u0037', '\u9E3a': '\u0038', '\u9f64': '\u0039'}#将映射表写成dict
for key in diction.keys():
price = price.replace(key, diction[key]).strip() #对有数字的对象中采用映射表替换
room = room.replace(key, diction[key]).strip()
dict = {'title': title,'price': price, 'room': room }
result.append(dict)
with open("F:/house/house.txt", 'w')as file:
file.write(json.dumps(result))
print(result)