Python利用Scrapy爬取前程无忧

**

Python利用Scrapy爬取前程无忧

**

一、爬虫准备
Python:3.x
Scrapy
PyCharm
二、爬取目标
爬取前程无忧的职位信息,此案例以Python为关键词爬取相应的职位信息,通过Scrapy来爬取相应信息,并将爬取数据保存到csv文件中。
三、爬取步骤
1.创建一个新的爬虫项目。
在这里插入图片描述
2.定义我们要爬取的内容item类

import scrapy

class QcwyItem(scrapy.Item):
    
    job_name = scrapy.Field()
    company = scrapy.Field()
    area = scrapy.Field()
    salary = scrapy.Field()
    pabulish_time = scrapy.Field()

3.配置settings.py
1)设置不遵守机器人协议

ROBOTSTXT_OBEY = False

2)设置请求头

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'

}

4.主函数spider爬取函数

# -*- coding: utf-8 -*-
import scrapy
from qcwy.items import QcwyItem
from scrapy.http import Request


class MainSpider(scrapy.Spider):
    name = 'main'
    # allowed_domains = ['51job.com']
    start_urls = ['https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html']


    # 生成要抓取的页面地址,从1-723页
    '''
    如果说通过获取下一页链接回调参数不能成功,可以采取自己生成页面链接的方式进行爬取内容
    '''
    def start_requests(self):
        pages = []
        for i in range(1, 724):
            newpage = scrapy.Request('https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,%d.html' %i)
            pages = newpage
            yield pages

    def parse(self, response):
        print(response.body)
        item = QcwyItem()
        jobs = response.xpath('//div[@class="el"]')
        for job in jobs:
            job_name = job.xpath('p/span/a/@title').extract_first()
            company = job.xpath('span/a/@title').extract_first()
            area = job.xpath('span[@class="t3"]/text()').extract()
            salary = job.xpath('span[@class="t4"]/text()').extract()
            pabulish_time = job.xpath('span[@class="t5"]/text()').extract()
            item['job_name'] = job_name
            item['company'] = company
            item['area'] = area
            item['salary'] = salary
            item['pabulish_time'] = pabulish_time
            yield item

        '''
        如果说下一页链接是可以调用,或者说拼接成新链接形式的,可以使用这种方法
        '''
            # nextpage = response.xpath('//ul/li[@class="bk"]/a/@href').extract()
            # url = nextpage   # 直接调用
            # url urljoin(nextpage)  # 链接拼接的形式
            # # print(url)
            # yield scrapy.Request(url=url, callback=self.parse)

5.根目录下添加一个运行函数start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl main -o qcwy.csv".split())

四、最终保存到csv文件数据
Python利用Scrapy爬取前程无忧_第1张图片
五、感言
Python路上的第一篇博客,前行不易,互勉之。

你可能感兴趣的:(Python利用Scrapy爬取前程无忧)