Python爬取安居客新房信息

由于是刚开始学习Python爬虫,做个简单的爬虫,提供一个学习思路.
由于水平有限,正则表达式写的实在是抠脚,就直接上BeautifulSoup了.
BeautifulSoup的学习参考http://cuiqingcai.com/1319.html,总结的很清楚,在这感谢下博主.
爬虫的思路:
1.获取要爬的url(如博客中我用的base_url);
2.headers伪装成浏览器访问;
3.通过urllib2的Request方法向服务器发送请求;
4.发送完请求后,服务器会返回你访问url的html页面,通过urllib2的urlopen方法读取;
5.然后通过BeautifulSoup解析页面,注意使用’lxml’解析页面,要不然程序会发出警告;
6.剩下的就是简单的Python代码和利用BeautifulSoup获取页面数据的方法,详情看代码.

# coding:utf-8
"""
功能:爬取安居客新房信息
"""
import urllib2
import bs4

pages = ['p1/', 'p2/', 'p3/', 'p4/', 'p5/', 'p6/']
base_url = "http://ly.fang.anjuke.com/loupan/all/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2859.0 Safari/537.36'}

# 爬取结果存储
house_file = open('anjuke.txt', 'a+')

for obj in pages:
    url = base_url + obj
    print url
    # 向服务器发送访问请求
    req = urllib2.Request(url=url, headers=headers)
    # 打开读取服务器返回的html页面
    response = urllib2.urlopen(req)
    # 通过BeautifulSoup处理返回的页面
    soup = bs4.BeautifulSoup(response, 'lxml')
    # 获取房产信息
    data_list = soup.select('.item-mod')
    # 从data_list中获取详情数据
    house_list = []
    for data in data_list:
        house = []
        try:
            # 获取楼盘名称
            item_name = data.select('div[class="infos"] > div[class="lp-name"] > h3 > a')
            if len(item_name) > 0:
                item = item_name[0].get_text()
                house_file.write(item.encode('utf-8') + ':')

            # 获取楼盘状态
            rec_status = data.select('div[class="infos"] > div[class="lp-name"] > i')
            if len(rec_status) > 0:
                rec = rec_status[0].get_text()
                house_file.write(rec.encode('utf-8') + ',')

            # 获取楼盘地址
            address_list = data.select('div[class="infos"] > p[class="address"] > a')
            if len(address_list) > 0:
                address = address_list[0].get_text()
                house_file.write(address.encode('utf-8') + ',')

            # 获取楼盘价格
            price_list = data.select('div[class="favor-pos"] > p[class="price"] > span')
            if len(price_list) > 0:
                price = price_list[0].get_text()
                house_file.write(price.encode('utf-8') + '\n')

        except Exception as e:
            print e

你可能感兴趣的:(Python爬虫)