接上一章内容,来看一下租房信息,导入数据:
import pandas as pd
import numpy as np
import pymongo
client = pymongo.MongoClient("mongodb://xx:[email protected]:2018",connect=False)
db = client["test"]
table = db["zufang"]
df = pd.DataFrame(list(table.find()))
del df["_id"]
df.head()
del df["house_info"]
del df["house_tags"]
df.tail()
经过去重发现,数据存在大量重复:
爬取的大量数据不能用,可见数据的清洗多么重要,分析了一下原因,原来是page到100多页后就不在出现新的信息,也就是说1000页和2000页是一样的。
于是我们得换一个方式重新爬取,我选着是按照地铁线路爬取,每一个爬100页,这样虽然信息爬取不全,但是相对还算完整。
获取地铁线路名称:
url = 'https://m.fang.com/zf/bj/r9/?jhtype=zf'
response = requests.get(url)
soup = BeautifulSoup(response.content.decode("gbk"),"lxml")
all_dd = soup.find("section",id="railway_section").find_all("dd")
for dd in all_dd[1:]:
with open("subway.txt","a",encoding="utf-8") as f:
f.write(dd.get_text()+"\n")
f.close()
获取全部url:
base_url = 'https://m.fang.com/zf/?purpose=%D7%A1%D5%AC&railway={}&jhtype=zf&renttype=cz&c=zf&a=ajaxGetList&city=bj&page={}'
f = open("subway.txt",encoding="utf-8")
for line in f:
subway = urllib.parse.quote(line.encode('gbk',"ignore"))
for i in range(1, 101):
start_url = base_url.format(subway,i)
然后就是漫长的等待。。。。。。
这回获取的只有1w8条信息,好在没有重复的了:
姑且先用着,但是还是感觉很不爽,于是我又分析力一下网站,发现每个小区都有一个特定的小区号,我们可以根据小区来爬取:
又是漫长的等待……,在等待之余先处理一下那1w8的数据,弄过清洗模板,等数据到手直接套用即可。
先导出小区明细,获取坐标……
把租金和面积转为浮点型:
df["floor_area"].astype(np.float)
df["rent"].astype(np.float)
df.head()
df2= df.iloc[:,[3,6,8]]
df2.head()
df1 = df2.groupby("location").count().reset_index().iloc[:,[0,1]].rename(columns={"floor_area":"count"})
df1.head()
df3 = df2.groupby("location").mean()
df3.head()
合并:
dff = pd.merge(df1,df3,left_on="location",right_index=True)
dff.head()
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
trace = go.Table(
header=dict(values=dff.columns,
fill = dict(color='#8B5A00'),
align = ['center'] * 5,
font = dict(color = 'white', size = 16),
height = 40),
cells=dict(values=[dff.location, dff.number, dff.floor_area, dff.rent],
fill = dict(color='#FAFAD2'),
align = ['center']*5,
height = 30))
data = [trace]
# 如果不是在ipython notebook上运行:plotly.offline.plot(data,filemane="xx")即可
py.iplot(data, filename = 'house_table')
配色可以自行更改:
我们以面积为想x轴,租金为y轴,数量为圆点大小,作图:
import plotly.graph_objs as go
import plotly.plotly as py
import numpy as np
trace1 = go.Scatter(
x = dff.floor_area,
y = dff.rent,
mode='markers',
marker=dict(
size=dff.number,
color = "#1C86EE",
sizemode ="area"
),
text =dff.location
)
data = [trace1]
py.iplot(data, filename='house_scatter')
颜色有点单一,我们再加点颜色:
上面一张图中租金和面积没有成正相关,是因为区位不同,我们选取其中一个区位,回龙观进行分析:
dfa = df2[df2["location"].isin(["回龙观"])]
dfa.head()
trace1 = go.Scatter(
x = dfa.floor_area,
y = dfa.rent,
mode='markers',
marker=dict(
size=10,
# color = "#1C86EE",
color = np.random.randn(len(dfa)),
colorscale='Viridis',
sizemode ="area"
),
# text =dff.location
)
data = [trace1]
py.iplot(data, filename='house_scatter1')
data = [go.Bar(
x=dfi.rent,
y=dfi.area,
orientation = 'h',
marker=dict(
color='rgba(55, 128, 191, 0.7)',
line=dict(
color='rgba(55, 128, 191, 1.0)',
width=2,
)
)
)]
py.iplot(data, filename='house-bar')
不同市区租房数量,租金一览:
表和图同时存在:
本章就到这里,下一章会将租房信息加到之前的地图上。