国内的电商网站,淘宝、京东、拼多多都爬取过了,今天来爬取一个对跨境电商很重要的亚马逊电商平台。
亚马逊全部商品接口为:https://www.amazon.cn/gp/site-directory/ref=nav_deepshopall_variant_fullstore_l1 ,通过该接口可以获取到需要的分类商品信息。
和之前一样,分为大分类、中分类、小分类,一步步深入,直到小分类的具体商品信息。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sVTo0bZB-1572573091131)(/media/editor/01_20190318101741444994.png)]
本次爬虫采用的是scrapy框架,item里爬取的信息有:
import scrapy
class AmazonItem(scrapy.Item):
# define the fields for your item here like:
collection = 'amazon' # 数据表
b_cate = scrapy.Field() # 大分类
m_cate = scrapy.Field() # 中分类
s_cate = scrapy.Field() # 小分类
s_href = scrapy.Field() # 小分类url
name = scrapy.Field() # 商品名称
goods_url = scrapy.Field() # 商品url
brand = scrapy.Field() # 商品品牌
price = scrapy.Field() # 商品价格
freight = scrapy.Field() # 运费
grade = scrapy.Field() # 评分(满分5分)
comment_count = scrapy.Field() # 评论人数
当然还可以提取其他一些信息,深入获取评论信息,有需要的话后续会提供。
使用scrapy爬取,spider里面编写主要的爬取方法。
需要注意的是:由于多层分类,在获取详情页url之前要使用深拷贝,避免爬取错乱。
meta={“item”: deepcopy(item) ,meta传递item所有的参数。
# -*- coding: utf-8 -*-
import scrapy
from amazon.items import AmazonItem
from copy import deepcopy
class YmxSpider(scrapy.Spider):
name = 'ymx'
allowed_domains = ['amazon.cn']
start_urls = ['https://www.amazon.cn/gp/site-directory/ref=nav_deepshopall_variant_fullstore_l1']
def parse(self, response):
# print(response.text)
item = AmazonItem()
div_list = response.xpath("//div[contains(@class,'a-spacing-top-medium')][7]") # 选取第5个的分类来爬,全部爬取去掉[5]
# 大分类分组
for div in div_list:
item['b_cate'] = div.xpath(".//span[contains(@class,'sd-fontSizeL1')]/a/text()").extract()
# print(item['b_cate'])
m_list = div.xpath(".//div[contains(@class,'sd-columnSize')]")
# 中间分类分组
for m in m_list:
item['m_cate'] = m.xpath(".//span[@class='sd-fontSizeL2 a-text-bold']/a/text()").extract_first()
# print(item['m_cate'])
ul_list = m.xpath(".//div[@class='a-row']/ul//span[@class='sd-fontSizeL2']")
# 小分类分组
for ul in ul_list:
item['s_cate'] = ul.xpath("./a/text()").extract_first()
item['s_href'] = ul.xpath("./a/@href").extract_first()
item['s_href'] = 'https://www.amazon.cn' + item['s_href']
# print(item['s_cate'])
if item['s_href'] is not None:
yield scrapy.Request(
item['s_href'],
callback=self.parse_detial,
meta={"item": deepcopy(item)}
)
def parse_detial(self, response):
item = response.meta["item"]
li_list = response.xpath("//div[@id='mainResults' or 'atfResults']/ul/li")
for li in li_list:
item["name"] = li.xpath(".//a[contains(@class,'s-access-detail-page')]/@title").extract_first()
item["goods_url"] = li.xpath(".//a[contains(@class,'s-access-detail-page')]/@href").extract_first()
item["brand"] = li.xpath(".//span[contains(@class,'a-size-small')][2]/text()").extract_first()
item["price"] = li.xpath(".//span[contains(@class,'s-price')]/text()").extract_first()
freight = li.xpath(".//div[contains(@class,'a-spacing-mini')][2]/div/span[2]/text()").extract_first()
if freight is not None:
item["freight"] = freight
else:
item["freight"] = '免运费'
grade = li.xpath(".//span[contains(@class,'a-declarative')]//span/text()").extract_first()
if grade is not None:
item["grade"] = grade
else:
item["grade"] = '暂无评分'
comment_count = li.xpath(".//span[contains(@class,'a-declarative')]/../../a[contains(@class,'a-size-small')]/text()").extract_first()
if comment_count is not None:
item["comment_count"] = comment_count
else:
item["comment_count"] = '暂无评论'
# print(item)
yield item
#下一页
next_url = response.xpath("//a[@id='pagnNextLink']/@href").extract_first()
if next_url is not None:
next_url = 'https://www.amazon.cn' + next_url
yield scrapy.Request(
next_url,
callback=self.parse_detial,
meta={"item": item}
)
在pipelines里处理spider返回的item字段的保存方法,这里使用MongoDB保存数据。
import pymongo
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.collection
# self.db[name].insert(dict(item))
self.db[name].update({'goods_url': item['goods_url']}, dict(item), True) # 以详情页地址过滤更新
return item
def close_spider(self, spider):
self.client.close()
运行start.py文件,就会看到爬虫已经开始运行,爬取内容会保存到MongoDB中。
爬取到MongoDB数据字段如下,这里并没有全部跑完。
具体参考代码放到GitHub上面,可以访问博客末尾查看。
有任何问题请留言,谢谢!