Python爬虫实战笔记_1 实战作业

爬取商品信息

由于58的二手商品平台转转上线,爬取的方法与老师的讲解有一些不一样:

  • 58的二手商品新平台转转,全是转转商品
  • 不区分个人商品与企业商品
  • 浏览量与网页一起加载,不再单独请求
  • 新的详情页无发贴时间信息,故不爬取
#!usr/bin/env python
#_*_ coding: utf-8 _*_

#  python3.5 vs python2.7
#  58zhuanzhuan

from bs4 import BeautifulSoup
import requests
import time


def geturls(urls):
    for url in urls:
        webdata = requests.get(url)
        soup = BeautifulSoup(webdata.text, 'lxml')
        itemlist = soup.select('tr.zzinfo > td.t > a.t')
        nav = getemtext(soup.select('div.nav a')[-1])
        for item in itemlist:
            itemurl = item.get('href')
            title = getemtext(item)
            get_target_info(itemurl, title, nav)
        time.sleep(1)

def getemtext(element):
    return element.get_text().strip()

def get_target_info(url, title='', nav=''):
    wbdata = requests.get(url)
    soup = BeautifulSoup(wbdata.text, 'lxml')
    #title = soup.select('div.box_left > div > div > h1')
    looktime = soup.select('span.look_time')[0]
    price = soup.select('span.price_now i')[0]
    place = soup.select('div.palce_li i')[0]
    data = {
        'title': title,
        'nav': nav,
        'looktime': getemtext(looktime).strip(u'次浏览'),
        'price': getemtext(price),
        'place': getemtext(place)
    }
    #print(data)
    print(data['title'])
    print('price: '+ data['price'] + ', view: '+ data['looktime']+ ' times' + ', area: ' + data['place'])

if __name__ == "__main__":
    urls = ["http://bj.58.com/pbdn/0/pn{}/".format(pageid) for pageid in range(1, 14)]
    geturls(urls)
    #http://bj.58.com/tushu/pn2
部分运行结果
微软平板SURFACE RT
price: 1500, view: 2560 times, area: 北京-丰台
三星超薄平板,
price: 1200, view: 801 times, area: 北京-通州
iPad1代
price: 680, view: 1333 times, area: 北京-朝阳
转让iPadmini2带发票和包装盒子16G配件齐全体大
price: 1512, view: 355 times, area: 北京-海淀
95成新16G IPAD4(the new ipad) 第一代高清屏的ipad,现使用无卡顿...
price: 1299, view: 1998 times, area: 北京-通州
全新ipad 没有注册的 零磨损 看图吧
price: 1599, view: 1400 times, area: 北京-大兴
苹果iPad4代贱卖
price: 1200, view: 114 times, area: 北京-顺义
总结
  • 类目与标题信息从列表页获取,作为参数传给get_target_info(),节省信息提取时间
  • 打印爬取的结果时,直接print(data),中文以unicode编码输出。print(data['title'])可以正常显示中文字符

你可能感兴趣的:(Python爬虫实战笔记_1 实战作业)