目标URL:http://bj.58.com/sale.shtml
爬虫任务:爬取一级页面商品的url,进入二级页面爬取商品信息,保存数据。
首先需要爬取一级页面商品的url,一级页面是li 的形式,通过xpath helper 解析前端
!!??测试时只能抓取第一个值
**解决方法:**用Selenium + Chrome获取就可以获取页面
!!??在进入二级页面的时候只能爬取一个url
报错信息:
//bj.58.com/shouji/38674092781339x.shtml
2019-07-13 09:58:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/sale.shtml>
{‘goods_url’: ‘//bj.58.com/shouji/38674092781339x.shtml’}
2019-07-13 09:58:39 [scrapy.core.scraper] ERROR: Spider error processing(referer:
None)
解决方法:
-url地址不对,需要拼接正确的地址- allowed_domains = [‘允许爬取的域名’], 如果解析到的域名(url)不在这儿, 就不会发送该url
用了两种解析方法bs4,xpath。在爬取之后要注意字符串的格式转换。
spider.py 代码如下:
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from city58_second_goods.items import City58SecondGoodsItem, GoodsItem
class SecondGoodsSpider(scrapy.Spider):
name = 'second_goods'
# allowed_domains = ['允许爬取的域名'], 如果解析到的域名(url)不在这儿, 就不会发送该url
allowed_domains = ['bj.58.com']
start_urls = ['https://bj.58.com/sale.shtml']
def parse(self, response):
print("*" * 80)
html = response.xpath('//div/ul/li/a/@href').extract()
html.pop(0)
# 获取二级页面url
item = City58SecondGoodsItem()
# print("1", html)
for i in html[:4]:
item["goods_url"] = i
goods_url = "https:" + i
print(goods_url, "jjjjjjjjjjjj")
yield item
yield scrapy.Request(url=goods_url, callback=self.goods_info)
def goods_info(self, response):
print("+" * 50)
soup = BeautifulSoup(response.text, 'lxml')
temp_title = soup.title.get_text()
print("1" * 50, temp_title)
title = temp_title.split(" - ")[0]
try:
temp_time = soup.select("div.detail-title__info > div")[0].get_text()
#[0].get_text()用来提取文本内容
time = temp_time.split(" ")[0]
temp_price = soup.select("span.infocard__container__item__main__text--price")[0].get_text()
price = temp_price.split()[0]
temp = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0].get_text()
if '成新' in temp:
color = temp
temp_area = soup.select("div.infocard__container > div:nth-of-type(3) > div:nth-of-type(2)")[0]
else:
color = None
temp_area = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0]
temp_area = list(temp_area.stripped_strings)
area = list(filter(lambda x: x.replace("-", ''), temp_area))
temp_cate = list(soup.select("div.nav")[0].stripped_strings)
cate = list(filter(lambda x: x.replace(">", ''), temp_cate))
item = GoodsItem()
item['goods_title'] = title
item['goods_time'] = time
item['goods_price'] = price
item['goods_color'] = color
item['goods_area'] = str(area)
item['goods_cate'] = str(cate)
yield item
except:
print("Error 404!")
通关管道保存数据,我用了4种管道方法(mongodb、mysql、xslx、csv)保存数据,选择你喜欢的一款~
'city58_second_goods.pipelines.MongodbSecondGoodsPipeline': 400,#保存到MongoDB
'city58_second_goods.pipelines.MysqlSecondGoodsPipeline': 400,#保存到mysql
'city58_second_goods.pipelines.XslxSecondGoodsPipeline': 400,#保存到xslx
'city58_second_goods.pipelines.CsvSecondGoodsPipeline': 400,#保存到csv
.pipelines.py
class MongodbSecondGoodsPipeline(object):
'''保存到mongodb'''
def __init__(self,
databaseIp='127.0.0.1',
databasePort=27017,
# user="mongo",
# password=None, #没有设置用户和密码
mongodbName='second_goods'):
client = MongoClient(databaseIp, databasePort)
self.db = client[mongodbName]
# self.db.authenticate(user, password)
def process_item(self, item, spider):
if isinstance(item, GoodsItem):
postItem = dict(item) # 把item转化成字典形式
self.db.scrapy.insert(postItem) # 向数据库插入一条记录
return item
class MysqlSecondGoodsPipeline(object):
'''保存到mysql'''
def __init__(self):
dbparams = {
'host': '127.0.0.1',
'port': 3306,
'user': 'root',
'password': '123456',
'database': 'second_goods',
'charset': 'utf8'
}
self.conn = pymysql.connect(**dbparams)
self.cursor = self.conn.cursor()
self._sql = None
def process_item(self, item, spider):
self.cursor.execute("""
insert into goods(goods_title,goods_time,goods_price,goods_color,goods_area,goods_cate) values(%s,%s,%s,%s,%s,%s)
""", (
item['goods_title'], item['goods_time'], item['goods_price'], item['goods_color'], item['goods_area'],
item['goods_cate']))
self.conn.commit()
return item
class XslxSecondGoodsPipeline(object):
'''保存到xslx'''
def open_spider(self, spider):
self.wb = Workbook()
# 创建excel
self.ws = self.wb.active
# 设置表头信息
self.ws.append(['标题', '时间', '价格', '颜色', '地区', '备注'])
def process_item(self, item, spider):
line = [item['goods_title'], item['goods_time'], item['goods_price'], item['goods_color'], item['goods_area'],
item['goods_cate']]
# 注意列表的顺序
self.ws.append(line)
return item
def close_spider(self, spider):
self.wb.save('goods.xlsx')
??创建xlsx表的时候报错:
AttributeError: ‘City58SecondGoodsPipeline’ object has no attribute ‘ws’
解决方法:更新版本没用。修改函数的结构,结构如下:
class City58SecondGoodsPipeline(object):
def open_spider(self, spider):
def process_item(self, item, spider):
def close_spider(self, spider):
self.wb.save('goods.xlsx')
??xlsx表有些字段无法获取
raise ValueError(“Cannot convert {0!r} to Excel”.format(value))
ValueError: Cannot convert [‘朝阳’] to Excel
解决方法:因为xslx只能保存字符串的形式,有两个字段不是字符串
通过print(type(project))查看一下project的类型
通过命令escape(project).encode(‘utf-8’)将其改为字符串类型
class CsvSecondGoodsPipeline(object):
'''保存到csv或json'''
def process_item(self, item, spider):
if isinstance(item, City58SecondGoodsItem): # 用来判断是哪个item
json_str = json.dumps(dict(item), ensure_ascii=False)
with open("url.csv", "a", encoding="utf-8") as f:
f.write(json_str + '\n')
elif isinstance(item, GoodsItem):
json_str = json.dumps(dict(item), ensure_ascii=False)
with open("goods.csv", "a", encoding="utf-8") as f:
f.write(json_str + '\n')
return item
今天爬虫项目实战,就先讲到这里啦,代码中还有很多方法可以去实现,我就不一一讲啊。代码在下面~~源码地址:
https://github.com/NicolasAcci/Python-Spider/tree/master/city58_second_goods