通过感应数据检测人员建模分析

Occupancy Detection,数据来自UCI
通过检测室内光,温度,湿度,二氧化碳来判断是否有人

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
​
%matplotlib inline

date time year-month-day hour:minute:secondTemperature, in CelsiusRelative Humidity, %Light, in LuxCO2, in ppmHumidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-airOccupancy, 0 or 1, 0 for not occupied, 1 for occupied status

导入数据

In [11]:

train = pd.read_csv("E:/figure/occupancy_data/datatraining.txt")
test = pd.read_csv("E:/figure/occupancy_data/datatest.txt")
train.head()

Out[11]:

train.png

探索数据

In [4]:

train.info()


Int64Index: 8143 entries, 1 to 8143Data columns (total 7 columns):
date 8143 non-null object
Temperature 8143 non-null float64
Humidity 8143 non-null float64
Light 8143 non-null float64
CO2 8143 non-null float64
HumidityRatio 8143 non-null float64
Occupancy 8143 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 477.1+ KB

共8143个数据 再看下是否有NA值

In [5]:

train.isnull().sum()

Out[5]:

date 0
Temperature 0
Humidity 0
Light 0
CO2 0
HumidityRatio 0
Occupancy 0
dtype: int64

In [6]:

#看下样本,是否有不正常的数据
train.describe()

Out[6]:

describe.png

In [7]:
feature相关性

train.corr()

Out[7]:

correlation.png

可以看到HumidityRatio和Humidity相关性超过95%

In [8]:

把date转换为datetime格式

b = []
from datetime import datetime
for i in train["date"]:
 b.append(datetime.strptime(i,"%Y-%m-%d %H:%M:%S"))
train["date"]=np.array(b)

画图看下各变量与occpancy关系图

In [72]:

style.use("classic")
ax1= plt.subplot2grid((4,1),(0,0),rowspan=1,colspan=1)
ax2= plt.subplot2grid((4,1),(1,0),rowspan=1,colspan=1,sharex=ax1)
ax3= plt.subplot2grid((4,1),(2,0),rowspan=1,colspan=1,sharex=ax1)
ax4= plt.subplot2grid((4,1),(3,0),rowspan=1,colspan=1,sharex=ax1)

ax1.plot(train["date"],train["Temperature"])
ax1.set_ylabel("Teperature")
ax5=ax1.twinx()
ax5.plot(train["date"],train["Occupancy"],color="g")
​
ax2.plot(train["date"],train["Humidity"])
ax2.set_ylabel("Humidity")
ax6=ax2.twinx()
ax6.plot(train["date"],train["Occupancy"],color="g")
​
ax3.plot(train["date"],train["Light"])
ax3.set_ylabel("Light")
ax7=ax3.twinx()
ax7.plot(train["date"],train["Occupancy"],color="g")
​
​
ax4.plot(train["date"],train["CO2"])
ax4.set_ylabel("CO2")
ax8=ax4.twinx()
ax8.plot(train["date"],train["Occupancy"],color="g")

Out[72]:
[]

各变量与Occupancy.png

看下Humidity和HumidityRatio相关性

In [78]:

style.use("ggplot")
fig=plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(train["date"],train["Humidity"])
ax2=ax1.twinx()
ax2.plot(train["date"],train["HumidityRatio"],color="b")

Out[78]:
[]

HumidityRatio/Humidity.png

建模

In [22]:

x_train = train.drop(["Occupancy","date"],axis=1)
y_train = train["Occupancy"]
x_test = test.drop(["Occupancy","date"],axis=1)
y_test = test["Occupancy"]

In [24]:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve

标准化features

In [25]:

from sklearn.preprocessing import StandardScaler
std = StandardScaler()
std.fit(x_train)
x_train_std = std.transform(x_train)
x_test_std = std.transform(x_test)

In [31]:

logistic模型

logreg =LogisticRegression()
logreg.fit(x_train_std,y_train)
y_predlog = logreg.predict(x_test_std)
logreg.score(x_train_std,y_train)

Out[31]:

0.9860002456097261

模型评估

In [28]:

看下F1分数

f1_score(y_test,y_predlog)

Out[28]:

0.9709418837675351

画出PR图

In [32]:

prob = logreg.predict_proba(x_test_std)
precision, recall, thresholds = precision_recall_curve(y_test,prob[:,1])
plt.plot(precision,recall)
plt.xlabel("precision")
plt.ylabel("recall")

Out[32]:

precison recall.png

作出report

In [33]:

from sklearn.metrics import classification_report
print(classification_report(y_test,logreg.predict(x_test_std),target_names=["unoccupancy","occupancy"]))
                precision       recall          f1-score        support
unoccupancy       1.00           0.97           0.98             1693 
occupancy         0.95          0.99           0.97             972
avg /total        0.98          0.98           0.98              2665

In [34]:

svm模型

svc =SVC()
svc.fit(x_train_std,y_train)
y_predsvc=svc.predict(x_test_std)
svc.score(x_train_std,y_train)

Out[34]:

0.98882475746039544

In [35]:

f1_score(y_test,y_predsvc)

Out[35]:

0.96035678889990084

In [39]:

random forest 模型

rf =RandomForestClassifier()
rf.fit(x_train_std,y_train)
y_predrf = rf.predict(x_test_std)
rf.score(x_train_std,y_train)

Out[39]:

0.99987719513692741

In [44]:

f1_score(y_test, y_predrf)

Out[44]:

0.90160427807486643

训练集达到99.98%,但在测试集上不是很好

综合三个模型,选用logistic regression

你可能感兴趣的:(通过感应数据检测人员建模分析)