python爬虫:青岛地区二手房信息

近来准备开始做一个有关于房价的分析项目,以重新熟练一下之前的爬虫知识,并应用一下近来学习的Tableau作图技巧,本次项目仅做交流使用,非具有任何商业用途。
为了保证信息对地区房价的真实反映,本项目选择链家网作为二手房信息的爬取网站,首先以青岛地区二手房为例进行爬取。

第一步,导入需要用到的库或模块。本次使用urllib库,通过xpath进行网页解析,由于笔者习惯对DataFrame形式的数据进行处理,因此在此导入pandas库。
import urllib.request
from lxml import etree
import pandas as pd
第二步,为了后续的数据框转换更加顺利,在网页解析部分写的有些过于细致,如果你不习惯用DataFrame,可以采用别的数据结构。
house_info = []
for page in range(1,101):
    url = 'https://qd.lianjia.com/ershoufang/pg'+str(page)
    html = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')
    selector = etree.HTML(html)
    page_info = selector.xpath('//li[@class="clear LOGCLICKDATA"]')
    print('正在爬第'+str(page)+'页')
    for i in range(len(page_info)):
        house_infor_one = []
        title = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/a/text()')
        house_infor_one.extend(title if title else ['.'])
        way = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/span/text()')
        house_infor_one.extend(way if way else ['.'])
        road = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()')
        house_infor_one.extend(road if road else ['.'])
        community = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()')
        house_infor_one.extend(community if community else ['.'])
        house_des = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/text()')
        house_infor_one.extend(house_des if house_des else ['.'])
        floor = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/text()')
        house_infor_one.extend(floor if floor else ['.'])
        popularity = page_info[i].xpath('div[@class="info clear"]/div[@class="followInfo"]/text()')
        house_infor_one.extend(popularity if popularity else ['.'])
        subway = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="subway"]/text()')
        house_infor_one.extend(subway if subway else ['.'])
        taxfree = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()')
        house_infor_one.extend(taxfree if taxfree else ['.'])
        haskey = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()')
        house_infor_one.extend(haskey if haskey else ['.'])
        total_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()')
        house_infor_one.extend(total_price if total_price else ['.'])
        price_unit = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()')
        house_infor_one.extend(price_unit if price_unit else ['.'])
        per_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')
        house_infor_one.extend(per_price if per_price else ['.'])
        house_info.append(house_infor_one)
第三步,将已经整理好格式的数据转换为数据框,并给他们的列进行命名,存到本地文件中,至此我们的数据就爬取结束啦
house_df = pd.DataFrame(house_info)
house_df.columns = ['房源描述', '房源来源', '房源地址(路)', '小区名称', '户型信息', '楼层', '人气', '距离地铁', '房本情况(个税)', '看房时间(钥匙)', '房源总价', '房源总价单位', '房源单价(平)','备注']
house_df.to_excel('D:/Tsingtao.xls',)

你可能感兴趣的:(python爬虫:青岛地区二手房信息)