ps:装环境,可以参考很多博客,该博客只提供思路和本人的自我总结
scrapy框架的使用流程分为四步:
我们以这个起始页面开始 start_urls = [‘https://search.51job.com/list/020000,000000,0000,00,9,99,Python%2520%25E9%25AB%2598%25E7%25BA%25A7,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’]
def parse(self, response):
selectors = response.xpath('//div[@class="el"]')
for selector in selectors:
url = selector.xpath('./p/span/a/@href').get(default='')
if url:
print(url)
yield scrapy.Request(url, callback=self.parseDetail)
def parseDetail(self, response):
corporation_name = response.xpath('//p[@class="cname"]/a/@title').get(default='')
post_name = response.xpath('//div[@class="cn"]/h1/@title').get(default='')
post_wage = response.xpath('//div[@class="cn"]/strong/text()').get(default='')
# items = {
# '公司': corporation_name,
# '岗位': post_name,
# '工资': post_wage
# }
items = JobspiderItem(name=corporation_name,post=post_name,wage=post_wage)
# self.result.append(items)
print(items)
yield items
修改settings.py里面的请求头来伪装浏览器请求,并且根据你要搜集的数据编写items.py如下:
import scrapy
class JobspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
name = scrapy.Field()
post = scrapy.Field()
wage = scrapy.Field()
由于我们要将数据存到数据库里面,故修改pipelines.py文件如下
from pymongo import MongoClient as mc
from jobSpider.settings import *
class JobspiderPipeline(object):
def __init__(self):
self.host = MONGO_HOST
self.port = MONGO_PORT
self.client = mc(self.host, self.port)
self.db = self.client[MONGO_DB]
self.collection = self.db[MONGO_COLLECTION]
def process_item(self, item, spider):
if not isinstance(item, dict):
item = dict(item)
self.collection.insert_one(item)
return item
**其中在settings.py中加入了如下的内容,为的就是不用每次都去pipelins里面去改变连接数据库的一些数据如端口和连接ip等等**
#启用一个Item Pipeline组件
ITEM_PIPELINES = {
'jobSpider.pipelines.JobspiderPipeline': 10,
}
# mongoDB ,import data into it
MONGO_HOST = '127.0.0.1'
MONGO_PORT = 27017
MONGO_DB = 'Job'
MONGO_COLLECTION = 'job'
这样就可以把扒来的数据存到MongoDB里面去。当然可以通过执行scrapy crawl job -o job.csv 或者 scrapy crawl job -o job.json,如果是第二种就不需要去改pipelines,只需要job.py里面yield数据就可以了