闲来无事,写取一个51job的爬虫,功能是爬取部分地区所有招聘公司的相关信息,最后导出成excel,做成数据报表
爬虫使用python的scrapy框架,简单高效,使用该爬虫还需要部分xpath的知识,需要简单进行学习下才能看懂规则语法
写网络爬虫,最重要的是进行分析目标站点的一个html结构,我们打开51job的搜索界面,按F12打开浏览器的调试器,刷洗,可以看到请求后端的一个url
从上图可以看到Request URL就是我们需要的连接,而且又哥curr_page=1参数,说明通过这个参数可以跳转到任意页码
分析图二,因为我们要抓列表下的所有公司信息,所有应该先定位到这个列表头部,然后发现有一个div标签是
id="resultList"
,所以可以以这个为切入点
以
为切入点之后,可以看到他们整个的html结构,公司的信息都是藏在
下面最后我们就可以根据xpath的语法,抓取到想要的信息
例如:
for sel in response.xpath('//div[@id="resultList"]/div[@class="el"]'): try: #公司链接 item['link'] = sel.xpath('span[@class="t2"]/a/@href').extract()[0] #公司名字 item['company'] = sel.xpath('span[@class="t2"]/a/text()').extract()[0] #地区 item['address'] = sel.xpath('span[@class="t3"]/text()').extract()[0] except: pass
最后:参照这种思路,就能爬取想要的结果,可以结合实际情况进行修改
代码篇
程序员都习惯看代码就知道表达的内容,下面直接贴上核心代码
# -*- coding: utf-8 -*- import logging import scrapy import urllib import codecs import os from scrapy.selector import Selector from ..items import Scrapy51JobItem import sys reload(sys) sys.setdefaultencoding('utf-8') keyword = "Python" #把字符串编码成符合url规范的编码 keywordcode = urllib.quote(keyword) is_start_page = True class TestfollowSpider(scrapy.Spider): name = "scrapy_51job" allowed_domains = ["51job.com"] start_urls = [ # 第一页开始 "http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=030000%2C140000%2C100000%2C110000%2C00&district=000000&funtype=2500&industrytype=32&issuedate=9&providesalary=99&keywordtype=1&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9" ] def parse(self, response): global is_start_page url = "" #从开始页面开始解析数据,开始页面start_urls if is_start_page: url = self.start_urls[0] is_start_page = False else: href = response.xpath('//div[@class="dw_page"]/div[@class="p_box"]/div[@class="p_wp"]/div[@class="p_in"]/ul/li[@class="bk"]/a/@href') url = href.extract()[0] # print url yield scrapy.Request(url, callback=self.parse_dir_contents) def parse_dir_contents(self, response): for sel in response.xpath('//div[@id="resultList"]/div[@class="el"]'): item = Scrapy51JobItem() try: item['link'] = sel.xpath('span[@class="t2"]/a/@href').extract()[0] item['company'] = sel.xpath('span[@class="t2"]/a/text()').extract()[0] item['address'] = sel.xpath('span[@class="t3"]/text()').extract()[0] yield item except: fileObj = open("/tmp/error_page.html",'w',buffering=-1) fileObj.write(response.body) fileObj.close() fileObj = open("/tmp/normal_page.html",'w',buffering=-1) fileObj.write(response.body) fileObj.close() try: now_page = response.xpath('//div[@class="dw_page"]/div[@class="p_box"]/div[@class="p_wp"]/div[@class="p_in"]/ul/li[@class="on"]/text()').extract()[0] print now_page next_page = """http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=030000%2C140000%2C100000%2C110000%2C00&district=000000&funtype=2500&industrytype=32&issuedate=9&providesalary=99&keywordtype=1&curr_page=""" + str(int(now_page) + 1) + """&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9""" except: next_page = False fileObj = open("/tmp/error_page.html",'w',buffering=-1) fileObj.write(response.body) fileObj.close() if next_page: print '爬取下一页....' url = next_page yield scrapy.Request(url, callback=self.parse_dir_contents)
然后是pipelines.py的类
class Scrapy51JobPipeline(object): def __init__(self): self.file = codecs.open('/tmp/51job.json', 'w', encoding = 'utf-8') def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii = False) + "\n" self.file.write(line) return item def spider_closed(self, spider): self.file.close()
item也需要设定需要的字段
class Scrapy51JobItem(scrapy.Item): # define the fields for your item here like: link = Field() company = Field() address = Field() pass
部分思路和代码参考来源Scrapy爬虫实践之搜索并获取前程无忧职位信息(基础篇)