51job的scrapy爬虫

ps:遇到了几个坑,1。自己写的正则以为没问题,结果实际没匹配到。2。allowed_domains = [‘51job.com’] 。刚开始是search.51job.com。但是我的rule中的一个是另个一个域名的,所以都被过滤掉了。查看debug日志才发现。3。vim缩进问题,在vim中看不出来。nano下编辑发现了。

主要更改的文件如下:
1)pipeline.py

import json
class WyjobPipeline(object):
    def __init__(self):
        self.filename = open("hz.json", 'a+')

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.filename.write(text.encode("utf-8"))
        return item


    def close_spider(self, spider):
        self.filename.close()

2.) items.py

import scrapy


class DongguanItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()
    url = scrapy.Field()
    number = scrapy.Field()

3)settings.py

BOT_NAME = 'dongguan'

SPIDER_MODULES = ['dongguan.spiders']
NEWSPIDER_MODULE = 'dongguan.spiders'
ITEM_PIPELINES = {
    'dongguan.pipelines.DongguanPipeline': 300,
}

LOG_FILE = "dg.log"
LOG_LEVEL = "DEBUG"

4.)hz.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import WyjobItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class HzSpider(CrawlSpider):
    name = 'hz'
    allowed_domains = ['51job.com']
    start_urls = ['http://search.51job.com/list/080200,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=1&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']

    rules = (
        Rule(LinkExtractor(allow=(r's=\d+&t=\d+')), callback='parse_item'),
        Rule(LxmlLinkExtractor(allow=(r'lang=c&stype'),restrict_xpaths=(r"/html/body/div[2]/div[4]/div[54]/div/div/div/ul/li")))
        ,
    )

    def parse_item(self, response):

         i = WyjobItem()
         i["position"] = response.xpath("//div//h1/text()").extract()[0]
         i["info"] = ",".join(response.xpath('//div[@class="t1"]//span/text()').extract())
         i["welfare"] = ",".join(response.xpath('//div[@class="jtag inbox"]//p/span/text()').extract())
         i["duty"] = " ".join(response.xpath('//div//div//div[@class="bmsg job_msg inbox"]/text()').extract()).replace(u"\t",'').replace(u"\n",'')
         i["address"] = " ".join(response.xpath('//div/div[@class="bmsg inbox"]/p/text()').extract()).replace(u"\t",'').replace(u"\n",'')
         yield  i

你可能感兴趣的:(爬虫,python爬虫)