Occupancy Detection,数据来自UCI
通过检测室内光,温度,湿度,二氧化碳来判断是否有人
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
date time year-month-day hour:minute:secondTemperature, in CelsiusRelative Humidity, %Light, in LuxCO2, in ppmHumidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-airOccupancy, 0 or 1, 0 for not occupied, 1 for occupied status
导入数据
In [11]:
train = pd.read_csv("E:/figure/occupancy_data/datatraining.txt")
test = pd.read_csv("E:/figure/occupancy_data/datatest.txt")
train.head()
Out[11]:
探索数据
In [4]:
train.info()
Int64Index: 8143 entries, 1 to 8143Data columns (total 7 columns):
date 8143 non-null object
Temperature 8143 non-null float64
Humidity 8143 non-null float64
Light 8143 non-null float64
CO2 8143 non-null float64
HumidityRatio 8143 non-null float64
Occupancy 8143 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 477.1+ KB
共8143个数据 再看下是否有NA值
In [5]:
train.isnull().sum()
Out[5]:
date 0
Temperature 0
Humidity 0
Light 0
CO2 0
HumidityRatio 0
Occupancy 0
dtype: int64
In [6]:
#看下样本,是否有不正常的数据
train.describe()
Out[6]:
In [7]:
feature相关性
train.corr()
Out[7]:
可以看到HumidityRatio和Humidity相关性超过95%
In [8]:
把date转换为datetime格式
b = []
from datetime import datetime
for i in train["date"]:
b.append(datetime.strptime(i,"%Y-%m-%d %H:%M:%S"))
train["date"]=np.array(b)
画图看下各变量与occpancy关系图
In [72]:
style.use("classic")
ax1= plt.subplot2grid((4,1),(0,0),rowspan=1,colspan=1)
ax2= plt.subplot2grid((4,1),(1,0),rowspan=1,colspan=1,sharex=ax1)
ax3= plt.subplot2grid((4,1),(2,0),rowspan=1,colspan=1,sharex=ax1)
ax4= plt.subplot2grid((4,1),(3,0),rowspan=1,colspan=1,sharex=ax1)
ax1.plot(train["date"],train["Temperature"])
ax1.set_ylabel("Teperature")
ax5=ax1.twinx()
ax5.plot(train["date"],train["Occupancy"],color="g")
ax2.plot(train["date"],train["Humidity"])
ax2.set_ylabel("Humidity")
ax6=ax2.twinx()
ax6.plot(train["date"],train["Occupancy"],color="g")
ax3.plot(train["date"],train["Light"])
ax3.set_ylabel("Light")
ax7=ax3.twinx()
ax7.plot(train["date"],train["Occupancy"],color="g")
ax4.plot(train["date"],train["CO2"])
ax4.set_ylabel("CO2")
ax8=ax4.twinx()
ax8.plot(train["date"],train["Occupancy"],color="g")
Out[72]:
[
看下Humidity和HumidityRatio相关性
In [78]:
style.use("ggplot")
fig=plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(train["date"],train["Humidity"])
ax2=ax1.twinx()
ax2.plot(train["date"],train["HumidityRatio"],color="b")
Out[78]:
[
建模
In [22]:
x_train = train.drop(["Occupancy","date"],axis=1)
y_train = train["Occupancy"]
x_test = test.drop(["Occupancy","date"],axis=1)
y_test = test["Occupancy"]
In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve
标准化features
In [25]:
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
std.fit(x_train)
x_train_std = std.transform(x_train)
x_test_std = std.transform(x_test)
In [31]:
logistic模型
logreg =LogisticRegression()
logreg.fit(x_train_std,y_train)
y_predlog = logreg.predict(x_test_std)
logreg.score(x_train_std,y_train)
Out[31]:
0.9860002456097261
模型评估
In [28]:
看下F1分数
f1_score(y_test,y_predlog)
Out[28]:
0.9709418837675351
画出PR图
In [32]:
prob = logreg.predict_proba(x_test_std)
precision, recall, thresholds = precision_recall_curve(y_test,prob[:,1])
plt.plot(precision,recall)
plt.xlabel("precision")
plt.ylabel("recall")
Out[32]:
作出report
In [33]:
from sklearn.metrics import classification_report
print(classification_report(y_test,logreg.predict(x_test_std),target_names=["unoccupancy","occupancy"]))
precision recall f1-score support
unoccupancy 1.00 0.97 0.98 1693
occupancy 0.95 0.99 0.97 972
avg /total 0.98 0.98 0.98 2665
In [34]:
svm模型
svc =SVC()
svc.fit(x_train_std,y_train)
y_predsvc=svc.predict(x_test_std)
svc.score(x_train_std,y_train)
Out[34]:
0.98882475746039544
In [35]:
f1_score(y_test,y_predsvc)
Out[35]:
0.96035678889990084
In [39]:
random forest 模型
rf =RandomForestClassifier()
rf.fit(x_train_std,y_train)
y_predrf = rf.predict(x_test_std)
rf.score(x_train_std,y_train)
Out[39]:
0.99987719513692741
In [44]:
f1_score(y_test, y_predrf)
Out[44]:
0.90160427807486643
训练集达到99.98%,但在测试集上不是很好
综合三个模型,选用logistic regression