前言,为了了解重庆各区招聘软件测试的情况,需要抓取前程无忧中搜索到的招聘信息,并把信息写到数据库,再分析数据。
1. 创建Scrapy项目:
scrapy startproject counter
2. 生成Spider:
cd counter
scrapy genspider cqtester www.51job.com
3. 组织需要的数据 -- Items.py
import scrapy
class CounterItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
title = scrapy.Field() # 标题
company = scrapy.Field() # 公司名称
desc = scrapy.Field() # 描述
salary = scrapy.Field() # 薪水范围
location = scrapy.Field() # 工作地点
date = scrapy.Field() # 发布日期
datasource = scrapy.Field() # 消息来源
4. 在设置中配置Mysql相关
# 配置Mysql
MYSQL_HOST = '192.168.25.214'
MYSQL_DBNAME = 'test'
MYSQL_USER = 'root'
MYSQL_PASSWD = 'Soft.rz'
CHARSET = 'utf8'
MYSQL_PORT = 3306
ITEM_PIPELINES = {
'counter.pipelines.CounterPipeline': 300,
}
5. 组装数据并保存到Mysql数据库
import pymysql
from scrapy.conf import settings # 导入在Settings文件里的配置
class CounterPipeline(object):
def process_item(self, item, spider):
host = settings['MYSQL_HOST']
user = settings['MYSQL_USER']
pwd = settings['MYSQL_PASSWD']
db = settings['MYSQL_DBNAME']
c = settings['CHARSET']
port = settings['MYSQL_PORT']
# 数据库连接
con = pymysql.connect(host=host, user=user, passwd=pwd, db=db, charset=c, port=port)
# 数据库游标
cue = con.cursor()
print("Mysql connect succes!")
sqls = "insert into cqtester(Title,Company,Salary,Location,date,DataSource) values(%s,%s,%s,%s,%s,%s)"
paras = (item['title'],item['company'],item['salary'],item['location'],item['date'],item['datasource'])
try:
cue.execute(sqls, paras)
print("insert success")
except Exception as e:
print("Insert error: ", e)
con.rollback()
else:
con.commit()
con.close()
return item
6. 爬虫实现
# -*- coding: utf-8 -*-
import scrapy
from counter.items import CounterItem
from scrapy.http import Request
class CqtesterSpider(scrapy.Spider):
name = 'cqtester'
allowed_domains = ['www.51job.com']
start_urls = ['https://jobs.51job.com/chongqing/ruanjianceshi/p1/']
def parse(self, response):
pages = response.xpath('//input[@id="hidTotalPage"]/@value').extract()[0]
pages = int(pages)
# print("\n The Page is %d \n" %pages)
for p in range(1, pages+1):
# print("第 %d 页 \n" %p)
yield Request("https://jobs.51job.com/chongqing/ruanjianceshi/p"+ str(p),callback=self.parsecontent, dont_filter=True)
def parsecontent(self, response):
contents = response.xpath('//p[@class="info"]')
for content in contents:
item = CounterItem()
item['title'] = content.xpath('span/a/@title').extract()
item['company'] = content.xpath('a/@title').extract()
pays = content.xpath('span[@class="location"]/text()').extract()
if not pays:
pays = '面议'
item['salary'] = pays
item['location'] = content.xpath('span[@class="location name"]/text()').extract()
pushdate = content.xpath('span[@class="time"]/text()').extract()
# pushdate = "2018-" + pushdate
item['date'] = pushdate
item['datasource'] = '51Job'
yield item
7. 运行后的结果:
8. 分析数据(FIC软件)
目前只抓取了静态网页的信息,另外还有百度百聘的动态页面未实现(http://zhaopin.baidu.com/quanzhi?tid=4139&ie=utf8&oe=utf8&query=%E9%87%8D%E5%BA%86+%E8%BD%AF%E4%BB%B6%E6%B5%8B%E8%AF%95+%E6%8B%9B%E8%81%98&city_sug=%E9%87%8D%E5%BA%86),其翻页通过JS实现,后续再研究。