写在前面,这次写智联招聘的爬虫是其次,主要的是通过智联招聘上的数据信息弄一个数据挖掘的小项目,这一篇主要是如何一气呵成的将智联招聘上的招聘信息给爬下来
(一)scrapy框架的使用
scrapy框架是python爬虫里面一个比较出色的框架,支持分布式,里面内部已经实现了从爬取解析到下载的一条龙服务,用这个框架或者是基于这个框架,可以很大程度上避免了一些不必要的bug,当然前提是你需要懂得并能去使用它。scrapy的简单安装与使用这里就暂时不介绍了,大家可以借助搜索引擎了解一下
(二) 创建项目
选好一个适合工作的空间目录,使用命令生成一个scrapy项目,我这选择了E盘
记不住scrapy命令的可以直接在dos输入 scrapy
,然后会给出一些提示的。
命令一:
scrapy startproject zhilianspider
这里是创建是一个工程,我们再创建一个spider,
命令二:
scrapy genspider zhilian "https://m.zhaopin.com/beijing"
(三)pycharm打开工程
尽量像这样子打开,麻烦会少些。马赛克的是我自己创建的,下面会公开的,没有马赛克的是最原始的生成工程的文件。
(四)编写spider
(1)item.py
import scrapy
class ZhilianspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
job_name = scrapy.Field()
job_link = scrapy.Field()
job_info = scrapy.Field()
job_tags = scrapy.Field()
company = scrapy.Field()
address = scrapy.Field()
salary = scrapy.Field()
获取的信息如下:
(2)pipelines.py (数据存入mongodb中)
import pymongo
class ZhilianspiderPipeline(object):
def __init__(self):
self.client = pymongo.MongoClient("localhost",connect=False)
db = self.client["zhilian"]
self.collection = db["python"]
def process_item(self, item, spider):
content = dict(item)
self.collection.insert(content)
print("###################已经存入MongoDB########################")
return item
def close_spider(self,spider):
self.client.close()
pass
这里用的本地mongodb数据库
(3)middlewares.py (主要是做一些被反爬的处理)
from zhilianspider.ua_phone import ua_list
"""
Ua 头信息
"""
class UserAgentmiddleware(UserAgentMiddleware):
def process_request(self, request, spider):
agent = random.choice(ua_list)
request.headers['User-Agent'] = agent
ua_phone.py里的内容:
ua_list = [
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
(4)settings.py 文件的设置
主要是关闭了robots协议,延迟0.5秒发生请求,UA头设置,pipeline下载设置。
SPIDER_MIDDLEWARES = {
'zhilianspider.middlewares.UserAgentmiddleware': 400,
}
ITEM_PIPELINES = {
'zhilianspider.pipelines.ZhilianspiderPipeline': 300,
}
(5)zhilian.py (spider解析)
# -*- coding: utf-8 -*-
import scrapy
from zhilianspider.items import ZhilianspiderItem
from bs4 import BeautifulSoup
class ZhilianSpider(scrapy.Spider):
name = 'zhilian'
allowed_domains = ['m.zhaopin.com']
# start_urls = ['https://m.zhaopin.com/hangzhou/']
start_urls = ['https://m.zhaopin.com/beijing-530/?keyword=python&pageindex=1&maprange=3&islocation=0']
base_url = 'https://m.zhaopin.com/'
def parse(self, response):
print(response.url)
# 这里是body 而不是text
soup = BeautifulSoup(response.body,'lxml')
all_sec = soup.find('div',class_='r_searchlist positiolist').find_all('section')
for sec in all_sec:
d_link = sec.find('a',class_='boxsizing')['data-link']
detail_link = self.base_url+d_link
if detail_link:
yield scrapy.Request(detail_link,callback=self.parse_detail)
# 是否有下一页的链接
if soup.find('a',class_='nextpage'):
next_url = self.base_url+soup.find('a',class_='nextpage')['href']
print('next_url ',next_url)
# 若果有重复的,则不进行过滤
yield scrapy.Request(next_url,callback=self.parse,dont_filter=True)
def parse_detail(self,response):
item = ZhilianspiderItem()
item['job_link'] = response.url
item['job_name'] = response.xpath('//*[@class="job-name fl"]/text()')[0].extract()
item['company'] = response.xpath('//*[@class="comp-name"]/text()')[0].extract()
item['address'] = response.xpath('//*[@class="add"]/text()').extract_first()
item['job_info'] = ''.join(response.xpath('//*[@class="about-main"]/p/text()').extract())
item['salary'] = response.xpath('//*[@class="job-sal fr"]/text()')[0].extract()
item['job_tags'] = ';'.join(response.xpath("//*[@class='tag']/text()").extract())
yield item
pass
(五)运行spider
- 方式一:scrapy crawl zhilian
- 方式二(建议):创建一个run.py 文件,然后运行
这里我用的是第二种方式。直接右键运行就可以了,这样方便许多。
代码截图:
(六)mongodb数据展示
一共爬取了北京python岗位4541条数据。
下面将在房天下使用scrapy_redis进行分布式爬取租房信息。