Python爬虫项目--58同城二手商品爬虫

Python爬虫实战–58同城二手商品

目标URL:http://bj.58.com/sale.shtml

爬虫任务:爬取一级页面商品的url,进入二级页面爬取商品信息,保存数据。

第一步:页面解析

Python爬虫项目--58同城二手商品爬虫_第1张图片

首先需要爬取一级页面商品的url,一级页面是li 的形式,通过xpath helper 解析前端

!!??测试时只能抓取第一个值

**解决方法:**用Selenium + Chrome获取就可以获取页面

!!??在进入二级页面的时候只能爬取一个url

报错信息:
//bj.58.com/shouji/38674092781339x.shtml
2019-07-13 09:58:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/sale.shtml>
{‘goods_url’: ‘//bj.58.com/shouji/38674092781339x.shtml’}
2019-07-13 09:58:39 [scrapy.core.scraper] ERROR: Spider error processing (referer:
None)

解决方法:
-url地址不对,需要拼接正确的地址- allowed_domains = [‘允许爬取的域名’], 如果解析到的域名(url)不在这儿, 就不会发送该url

第二步:数据爬取

用了两种解析方法bs4,xpath。在爬取之后要注意字符串的格式转换。
spider.py 代码如下:

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from city58_second_goods.items import City58SecondGoodsItem, GoodsItem


class SecondGoodsSpider(scrapy.Spider):
    name = 'second_goods'
    # allowed_domains = ['允许爬取的域名'], 如果解析到的域名(url)不在这儿, 就不会发送该url
    allowed_domains = ['bj.58.com']
    start_urls = ['https://bj.58.com/sale.shtml']

    def parse(self, response):
        print("*" * 80)
        html = response.xpath('//div/ul/li/a/@href').extract()
        html.pop(0)
        # 获取二级页面url
        item = City58SecondGoodsItem()
        # print("1", html)
        for i in html[:4]:
            item["goods_url"] = i
            goods_url = "https:" + i
            print(goods_url, "jjjjjjjjjjjj")
            yield item
            yield scrapy.Request(url=goods_url, callback=self.goods_info)

    def goods_info(self, response):
        print("+" * 50)
        soup = BeautifulSoup(response.text, 'lxml')
        temp_title = soup.title.get_text()
        print("1" * 50, temp_title)
        title = temp_title.split(" - ")[0]
        try:
            temp_time = soup.select("div.detail-title__info > div")[0].get_text()
            #[0].get_text()用来提取文本内容
            time = temp_time.split(" ")[0]
            temp_price = soup.select("span.infocard__container__item__main__text--price")[0].get_text()
            price = temp_price.split()[0]
            temp = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0].get_text()
            if '成新' in temp:
                color = temp
                temp_area = soup.select("div.infocard__container > div:nth-of-type(3) > div:nth-of-type(2)")[0]
            else:
                color = None
                temp_area = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0]
            temp_area = list(temp_area.stripped_strings)
            area = list(filter(lambda x: x.replace("-", ''), temp_area))
            temp_cate = list(soup.select("div.nav")[0].stripped_strings)
            cate = list(filter(lambda x: x.replace(">", ''), temp_cate))
            item = GoodsItem()
            item['goods_title'] = title
            item['goods_time'] = time
            item['goods_price'] = price
            item['goods_color'] = color
            item['goods_area'] = str(area)
            item['goods_cate'] = str(cate)
            yield item
        except:
            print("Error 404!")

第三步:数据保存

通关管道保存数据,我用了4种管道方法(mongodb、mysql、xslx、csv)保存数据,选择你喜欢的一款~

   'city58_second_goods.pipelines.MongodbSecondGoodsPipeline': 400,#保存到MongoDB
   'city58_second_goods.pipelines.MysqlSecondGoodsPipeline': 400,#保存到mysql
   'city58_second_goods.pipelines.XslxSecondGoodsPipeline': 400,#保存到xslx
   'city58_second_goods.pipelines.CsvSecondGoodsPipeline': 400,#保存到csv

.pipelines.py

  • 保存到MongoDB
class MongodbSecondGoodsPipeline(object):
    '''保存到mongodb'''
    def __init__(self,
                 databaseIp='127.0.0.1',
                 databasePort=27017,
                 # user="mongo",
                 # password=None, #没有设置用户和密码
                 mongodbName='second_goods'):
        client = MongoClient(databaseIp, databasePort)
        self.db = client[mongodbName]
        # self.db.authenticate(user, password)

    def process_item(self, item, spider):
        if isinstance(item, GoodsItem):
            postItem = dict(item)  # 把item转化成字典形式
            self.db.scrapy.insert(postItem)  # 向数据库插入一条记录
        return item 
  • 保存到mysql
class MysqlSecondGoodsPipeline(object):
    '''保存到mysql'''

    def __init__(self):
        dbparams = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'database': 'second_goods',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        self.cursor.execute("""
                insert into goods(goods_title,goods_time,goods_price,goods_color,goods_area,goods_cate) values(%s,%s,%s,%s,%s,%s)
                """, (
            item['goods_title'], item['goods_time'], item['goods_price'], item['goods_color'], item['goods_area'],
            item['goods_cate']))
        self.conn.commit()
        return item

下面是建立的mysql表结构:
Python爬虫项目--58同城二手商品爬虫_第2张图片

  • 保存到xlsx表
class XslxSecondGoodsPipeline(object):
    '''保存到xslx'''

    def open_spider(self, spider):
        self.wb = Workbook()
        # 创建excel
        self.ws = self.wb.active
        # 设置表头信息
        self.ws.append(['标题', '时间', '价格', '颜色', '地区', '备注'])

    def process_item(self, item, spider):
        line = [item['goods_title'], item['goods_time'], item['goods_price'], item['goods_color'], item['goods_area'],
                item['goods_cate']]
        # 注意列表的顺序
        self.ws.append(line)
        return item

    def close_spider(self, spider):
        self.wb.save('goods.xlsx')

??创建xlsx表的时候报错:

AttributeError: ‘City58SecondGoodsPipeline’ object has no attribute ‘ws’

解决方法:更新版本没用。修改函数的结构,结构如下:

class City58SecondGoodsPipeline(object):
    def open_spider(self, spider):
    
    def process_item(self, item, spider):

    def close_spider(self, spider):
        self.wb.save('goods.xlsx')         
        

??xlsx表有些字段无法获取

raise ValueError(“Cannot convert {0!r} to Excel”.format(value))
ValueError: Cannot convert [‘朝阳’] to Excel

解决方法:因为xslx只能保存字符串的形式,有两个字段不是字符串
通过print(type(project))查看一下project的类型
通过命令escape(project).encode(‘utf-8’)将其改为字符串类型

  • 保存scv/json
class CsvSecondGoodsPipeline(object):
    '''保存到csv或json'''

    def process_item(self, item, spider):
        if isinstance(item, City58SecondGoodsItem):  # 用来判断是哪个item
            json_str = json.dumps(dict(item), ensure_ascii=False)
            with open("url.csv", "a", encoding="utf-8") as f:
                f.write(json_str + '\n')
        elif isinstance(item, GoodsItem):
            json_str = json.dumps(dict(item), ensure_ascii=False)
            with open("goods.csv", "a", encoding="utf-8") as f:
                f.write(json_str + '\n')
        return item

今天爬虫项目实战,就先讲到这里啦,代码中还有很多方法可以去实现,我就不一一讲啊。代码在下面~~源码地址:
https://github.com/NicolasAcci/Python-Spider/tree/master/city58_second_goods

你可能感兴趣的:(爬虫,python)