爬虫之爬取链家的小区信息

链家小区网址:https://m.lianjia.com/bj/xiaoqu/
Github:https://github.com/why19970628/Python_Crawler/tree/master/LianJia
目标:统计北京每个区的小区

1.爬取每个区域的链接:

爬虫之爬取链家的小区信息_第1张图片

2. 爬取每个区域各个小区的链接:

爬虫之爬取链家的小区信息_第2张图片

3.爬取进入详情页

爬虫之爬取链家的小区信息_第3张图片

4. 爬取工作

在这里插入图片描述
爬取链家数据还是比较慢的,大约一秒一个,我们可以尝试使用多线程和进程的方式来提高爬取效率。

  • 多线程
    爬虫之爬取链家的小区信息_第4张图片

5.保存文件

housedetail=[]
def run():
    url="https://m.lianjia.com/bj/xiaoqu/"
    headers = {'Referer': url,
                   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
    req = request.Request(url, headers=headers)
    html = request.urlopen(req).read().decode('utf-8')
    html = etree.HTML(html)
    links = html.xpath('//ul[@class="level2 active"]/li/a/@href')[1:-1]
    area = html.xpath('//ul[@class="level2 active"]/li/a/text()')[1:-1]
    for area,link in zip(area,links):
        for n in range(1,3):
                start1 = time.time()
                print('线程1··正在爬取'+area+'~~第'+str(n)+'页·····')
                url = "https://m.lianjia.com/bj/xiaoqu/pg" + str(n) + "/"
                headers = {'Referer': url,
                       'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
                req = request.Request(url, headers=headers)
                html = request.urlopen(req).read().decode('utf-8')
                html = etree.HTML(html)
                links = html.xpath('//li[@class="pictext"]/a/@href')
                for i in links:
                    result = {}
                    result['area']=area
                    req = request.Request(i, headers=headers)
                    html = request.urlopen(req).read().decode('utf-8')
                    html = etree.HTML(html)
                    result['name'] = html.xpath('//div[@class="xiaoqu_head_title lazyload_ulog"]/h1/text()')[0]
                    # print(result['name'])
                    result['address'] = \
                    html.xpath('//div[@class="xiaoqu_head_title lazyload_ulog"]/p[@class="xiaoqu_basic"]/span/text()')[0]
                    # print(result['address'])
                    result['price'] = html.xpath('//div[@class="xiaoqu_price"]/p/span/text()')[0]
                    # print(result['price'])
                    a = html.xpath('//p[@class="text_cut"]/span[@class="sub_title"]/text()')[0]
                    b = html.xpath('//p[@class="text_cut"]/em/text()')[0]
                    result['jianzhu'] = str(a) + str(b)
                    result['num'] = html.xpath('//div[@class="worth_card"]/div[@class="worth_guide"]/ul/li/text()')[3]
                    #print(result['num'])
                    result['link'] = i
                    housedetail.append(result)

                    time.sleep(0.5)

                print('爬取' + area + '第' + str(n) + '页完成,用时:', time.time() - start1)
    df = pandas.DataFrame(housedetail)
    df.to_csv('housedata5.csv', index=False)

线程参考文章:https://blog.csdn.net/yexudengzhidao/article/details/86750810

你可能感兴趣的:(#,爬虫项目)