enlightened by 挖掘机小王子

ps:装环境,可以参考很多博客,该博客只提供思路和本人的自我总结

scrapy框架的使用流程分为四步:

  • scrapy startproject jobSpider
  • cd jobSpider
  • scrapy genspider job
  • edit this job.py
  • scrapy crawl job

我们以这个起始页面开始 start_urls = [‘https://search.51job.com/list/020000,000000,0000,00,9,99,Python%2520%25E9%25AB%2598%25E7%25BA%25A7,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’]

    def parse(self, response):
        selectors = response.xpath('//div[@class="el"]')
        for selector in selectors:
            url = selector.xpath('./p/span/a/@href').get(default='')
            if url:
                print(url)
                yield scrapy.Request(url, callback=self.parseDetail)

    def parseDetail(self, response):
        corporation_name = response.xpath('//p[@class="cname"]/a/@title').get(default='')
        post_name = response.xpath('//div[@class="cn"]/h1/@title').get(default='')
        post_wage = response.xpath('//div[@class="cn"]/strong/text()').get(default='')

        # items = {
        #     '公司': corporation_name,
        #     '岗位': post_name,
        #     '工资': post_wage
        # }
        items = JobspiderItem(name=corporation_name,post=post_name,wage=post_wage)
        # self.result.append(items)
        print(items)
        yield items




修改settings.py里面的请求头来伪装浏览器请求,并且根据你要搜集的数据编写items.py如下:

import scrapy


class JobspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    name = scrapy.Field()
    post = scrapy.Field()
    wage = scrapy.Field()

由于我们要将数据存到数据库里面,故修改pipelines.py文件如下

from pymongo import MongoClient as mc
from jobSpider.settings import *

class JobspiderPipeline(object):
    def __init__(self):
        self.host = MONGO_HOST
        self.port = MONGO_PORT
        self.client = mc(self.host, self.port)
        self.db = self.client[MONGO_DB]
        self.collection = self.db[MONGO_COLLECTION]

    def process_item(self, item, spider):
        if not isinstance(item, dict):
            item = dict(item)
        self.collection.insert_one(item)
        return item

**其中在settings.py中加入了如下的内容,为的就是不用每次都去pipelins里面去改变连接数据库的一些数据如端口和连接ip等等**


#启用一个Item Pipeline组件
ITEM_PIPELINES = {
    'jobSpider.pipelines.JobspiderPipeline': 10,
}
# mongoDB ,import data into it
MONGO_HOST = '127.0.0.1'
MONGO_PORT = 27017
MONGO_DB = 'Job'
MONGO_COLLECTION = 'job'

这样就可以把扒来的数据存到MongoDB里面去。当然可以通过执行scrapy crawl job -o job.csv 或者 scrapy crawl job -o job.json,如果是第二种就不需要去改pipelines,只需要job.py里面yield数据就可以了

你可能感兴趣的:(python,python,mongodb)