python爬虫 scrapy 爬取腾讯招聘

第一步创建项目:

移步 https://blog.csdn.net/mgmgzm/article/details/85849918 查看项目创建方法

第二步需求分析:

1) 获取腾讯招聘搜索结果页

2) 获取每一条结果对应的详情信息

3) 二次解析页面

python爬虫 scrapy 爬取腾讯招聘_第1张图片

第三步废话少说上代码:

setting文件配置:

# 打开请求头
USER_AGENT = 'day9 (+http://www.yourdomain.com)'

# 将 ROBOTSTXT_OBEY 的值改为False
ROBOTSTXT_OBEY = False

# 打开 ITEM_PIPELINES
ITEM_PIPELINES = {
   'day9.pipelines.Day9Pipeline_tengxun': 300,
}

# 记录日志,在setting末尾加入
LOG_FILE = 'meiju.log'
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_LEVEL = 'DEBUG'

spider文件:

# -*- coding: utf-8 -*-
import scrapy
from lxml import etree
from ..items import Day9TengXun
import datetime,time


class MgTencentSpider(scrapy.Spider):
    name = 'mg_tencent'
    allowed_domains = ['hr.tencent.com']
    start_urls = []
    # 目标网址
    base_url = 'https://hr.tencent.com/position.php?keywords=Python&tid=0&start=%d'
    # 搜索结果总共有55页
    for page in range(0,55):
        url = base_url%(page*10)
        # 将url添加到start_urls中
        start_urls.append(url)

    # 获取页面内容
    def parse(self, response):
        html = response.body.decode('utf-8')
        # 使用xpath 解析页面
        tree = etree.HTML(html)
        tr_list = tree.xpath('//table[@class="tablelist"]/tr')
        # 去除头和尾
        tr_list.pop(0)
        tr_list.pop()
        # 提取获取到的数据
        for tr in tr_list:
            # 导入items 模块用来存储数据
            item = Day9TengXun()
            td = tr.xpath('.//td')    

            # 提取职位名称
            position1 = td[0].xpath('./a/text()')[0]
            item['position']=position1

            # 提取职位类别
            types1 = td[1].xpath('./text()')[0]
            item['types']=types1

            # 提取招聘人数
            num1 = td[2].xpath('./text()')[0]
            item['num']=num1

            # 提取招聘地点
            address1 = td[3].xpath('./text()')[0]
            item['address']=address1

            # 提取发布时间
            time1 = td[4].xpath('./text()')[0]
            item['time']=time1

            # 提取详情链接
            detail_url = td[0].xpath('./a/@href')[0]
            item['detail_url']=detail_url

            # 日志时间
            item['log_time'] = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')

            # 第二次请求页面的url
            detail_urls = 'https://hr.tencent.com/' + detail_url
            # dont_filter过滤过请求过的,用生成器将数据挂起
            yield scrapy.Request(detail_urls,callback=self.parse_detail,meta={'data':item},dont_filter=False)

    # 获取详情页内容
    def parse_detail(self, response):
        # 接收上面传过来的数据
        item = response.meta['data']
        # 编码解码
        content = response.body.decode('utf-8')
        # xpath 解析页面
        tree1 = etree.HTML(content)
        # 提取所需内容
        duty_list = tree1.xpath('//table[@class="tablelist textl"]/tr[3]//text()')
        duty_list1 = tree1.xpath('//table[@class="tablelist textl"]/tr[4]//text()')

        duty1=''
        for duty in duty_list:
            # 将内容中多余的空格字符去除
            duty1 = duty1 + duty.strip()
            
        duty2=''    
        for dutys in duty_list1:
            duty2 = duty2 + dutys.strip()

        # 提取工作职责
        item['duty1'] = duty1
        # 提取工作要求
        item['duty2'] = duty2

        yield item

items文件:

items中的变量一定要与spider中要存储的个数一样,spider中需要存储的变量不可以多余items中的

import scrapy


class Day9TengXun(scrapy.Item):
    position = scrapy.Field()
    types = scrapy.Field()
    num = scrapy.Field()
    address = scrapy.Field()
    detail_url = scrapy.Field()    
    time = scrapy.Field()
    duty1 = scrapy.Field()
    duty2 = scrapy.Field()
    log_time = scrapy.Field()

pipelines文件:

import pymongo,json

class Day9Pipeline_tengxun(object):
    def __init__(self):
        # 建立连接
        self.client = pymongo.MongoClient('localhost')
        # 创建库
        self.db = self.client['tencent']
        # 建表
        self.table = self.db['tengxun']

    def process_item(self, item, spider):
        # 将数据添加到MongoDB
        self.table.insert(dict(item))
        # 以json格式存储
        fp = open('tx.json','a',encoding='utf-8')
        json.dump(dict(item),fp,ensure_ascii=False)

        return item

运行爬虫:scrapy crawl tencent

导出mongoDB 数据:

进入mongoDB 的 bin 目录下:

mongoexport -d tencent -c tengxun --csv -f position,time -o d:/tencent.csv

 

你可能感兴趣的:(爬虫)