Python 实战计划学习笔记:爬取租房网站信息

学习用Python爬取租房网站内容,包括房屋的租金、地址、房东昵称、性别、房屋图片

Python 实战计划学习笔记:爬取租房网站信息_第1张图片
Paste_Image.png

我的代码:

import bs4
import requests
import time

heads = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}

house_list_urls = ["http://sh.xiaozhu.com/search-duanzufang-p{}-0/".format(str(i)) for i in range(1,12)]



def get_house_info(url):
    response = requests.get(url,headers = heads)
    time.sleep(2)
    soup = bs4.BeautifulSoup(response.text,"lxml")

    title = soup.select('div.pho_info > h4 > em')[0].get_text()
    address = soup.select('div.pho_info > p')[0].get('title')
    price = soup.select('div.day_l > span')[0].get_text()
    avator = soup.select('div.member_pic > a > img')[0].get('src')
    sex = soup.select('div.member_pic > div')[0].get('class')[0]
    sex = "male" if sex == "member_ico" else "female"
    lord = soup.select("a.lorder_name")[0].get_text()

    print(title,address,price,avator,sex,lord)

def get_houses(url):
    response = requests.get(url,headers = heads)
    soup = bs4.BeautifulSoup(response.text,'lxml')
    house_list = [i.parent.get('href') for i in soup.select('img.lodgeunitpic')]
    for i in house_list:
        get_house_info(i)

for i in house_list_urls:
    get_houses(i)

总结:

  • select()返回的是list,哪怕是单个元素
  • request.get(url,headers = xxx) 注意headers有"s"
  • soup.get("class")返回的也是list
  • 从房源列表中获取房源链接时,可以先定位img图片,再用parent属性获得a tag
  • bs4.BeautifulSoup(response.text,'lxml') 不要忘了.text属性

问题:

  • 为何抓取的图片链接无法打开?源码中明明是抓取的图片链接

你可能感兴趣的:(Python 实战计划学习笔记:爬取租房网站信息)