关键字
K_means、ARIMA
前言
一月份主要工作如下:
精细化数据预处理
过滤掉单一地点mac、过滤掉出现天数低于10天的mac、进一步细分地点列表;
数据索引
保留两份原始数据,以不同的索引保存,便于后续检索
a.时间戳、地点->mac
b.日期、mac->时间段:地点
人员数目分布统计
聚类准备
将人员关于地点的时间分布以ndarray的形式呈现(经过数据处理)
1.
第一部分的工作只是简单的修改了之前的代码,内容意义不是很多,所以这里就不详细记录啦~
数据索引这块,详细记录一下通过日期和mac索引到place id的过程:
输入参数:
start_time 开始时间
end_time 结束时间
mac 索引的mac地址对象
输出
stime1,etime1,pid1 停留时间段1
stime2,etime2,pid2 停留时间段2
...
stimen,etimen,pidn 停留时间段n
数据片段
2017-09-11 00:00:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:00:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:00:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:00:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:00:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:00:00,0,149,dormitory,4c49e3406f61,N
2017-09-11 00:00:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:00:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:00:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:00:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:00:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:00:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:00:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:00:00,0,168,edu,74042bcb3a77,N
2017-09-11 00:00:00,0,181,canteen,40f02f4c670d,N
2017-09-11 00:00:00,0,193,edu,8844773c62e3,N
2017-09-11 00:00:00,0,240,canteen,4c1a3d3f0f21,N
2017-09-11 00:01:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:01:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:01:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:01:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:01:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:01:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:01:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:01:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:01:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:01:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:01:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:01:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:01:00,0,168,edu,74042bcb3a77,N
python代码:
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
数据维度变换2
日期、mac->时间段:地点
重新索引后的数据格式:
起始时间1,终止时间1,place id1
起始时间2,终止时间2,place id2
起始时间3,终止时间3,place id3
...
并得到规定时间内的轨迹数组:
[142, 202, 142, 202, 200, 202, 200, 142, 142](example)
输入参数:mac地址、开始时间、结束时间
"""
start_time ='2017-09-11 00:00:00'
end_time ='2017-09-18 00:00:00'
mac ='205d4717e6de'
def findpathByMacDate(mac,start_time,end_time):
records = pd.read_csv('./macdata/normalinfo_trans.txt',names=['timestamp','timerange','pid','ptype','mac','isholiday'])
#读取源数据,并指明列名(时间、时间范围、地点id、地点类型、mac、是否为节假日)
records_select = records[(records['mac']==mac) &(records['timestamp'] >start_time) &(records['timestamp']
数据重新索引看上去比较麻烦,巧妙运用pandas进行数据聚合、筛选操作,发现代码量并不多,很容易就实现了~
2.人员分布统计
工作内容:
根据日期、时间段、地点类型(地点)等三个维度统计mac数量。柱状图同时显示两个维度(固定第三个维度),显示时可以切换第三个维度便于观察特征
输入:start_time,end_time
按天输出:不同地点类型的mac数量
按时段输出:不同地点类型的mac数量
返回文件属性说明:
宿舍,食堂,教学楼,体育馆/学生活动中心
python代码
# -*- coding: UTF-8 -*-
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
__author__ = 'SuZibo'
"""
根据日期、时间段、地点类型(地点)等三个维度统计mac数量。柱状图同时显示两个维度(固定第三个维度),显示时可以切换第三个维度便于观察特征
输入:start_time,end_time
按天输出:不同地点类型的mac数量
按时段输出:不同地点类型的mac数量
返回文件属性说明:
宿舍,食堂,教学楼,体育馆/学生活动中心
"""
dormitory =[141,142,145,146,148,149,150,151,152,153]
canteen =[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu =[54,60,133,134,136,154,155,156,157,158,159,160,161,162,164,165,166,167,168,169,193,194,195,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,227,230,231,233,234,235,236,237,238,239]
stadium =[183,184,185,186,187,188,189,190,191,232]
stime ='2017-09-11 00:00:00'
# etime ='2017-09-12 00:00:00'
etime ='2017-09-12 00:00:00'
#小循环里面的时间上限和下限
weekdaylist =[]
start_date = '2017-09-11'
# end_date = '2017-11-13'
end_date='2017-11-13'
#大循环的时间上限和下限
sdate = datetime.datetime.strptime(start_date,'%Y-%m-%d')
edate = datetime.datetime.strptime(end_date,'%Y-%m-%d')
while sdate
关于人员统计,需要熟练运用python字典里面的get方法
简要陈述字典get方法:
语法
get()方法语法:
dict.get(key, default=None)
参数
key -- 字典中要查找的键。
default -- 如果指定键的值不存在时,返回该默认值值。
返回值
返回指定键的值,如果值不在字典中返回默认值None。
3.人员时间分布矩阵获取
工作内容:
以male_dor,famale_dor,postgraduate_dor,net,hospital,canteen,edu,lab,stadium,activity,administration,library为属性
建立人员出现时长矩阵(以mac为索引)
python代码:
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
统计每个人时间特征矩阵(地点分布)
地点list
male_dor=[141,145,146,149,151]
#男生宿舍
famale_dor=[148,150,152,153]
#女生宿舍
postgraduate_dor=[142]
#研究生宿舍
net=[217,229]
#网络中心
hospital=[192]
#校医院
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
#食堂
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
#教学楼
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
#实验室
stadium=[189,190,191]
#体育馆
activity=[183,184,185,186,187,188,232]
#学生活动中心
administration=[221,222,223]
#行政楼
library=[193,194,195,227]
#图书馆
"""
mac_time_dic =dict()
#建立字典存储mac对应的时间统计信息,因为源数据的时间周期为1min,利用此特性累加得到的结果正好就是时长(单位为min)
# start_time ='2017-09-11 00:00:00'
# end_time ='2017-11-13 00:00:00'
# frame_data = pd.read_csv('../macinfo/macdata/normalinfo_trans_v2.txt',header=None)
# print frame_data.tail()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
for line in file:
# print line
line = line.split(',')
line[-1] = line[-1].strip('\n')
if line[4] not in mac_time_dic:
mac_time_dic[line[4]] = dict()
mac_time_dic[line[4]][line[3]] = mac_time_dic[line[4]].get(line[3], 0) + 1
#{mac1:{place1:m,place2:n,...},...}
#{'10b1f8f3a4d0': {'famale_dor': 10, 'male_dor': 507, 'hospital': 10, 'activity': 10, 'library': 4, 'edu': 41, 'canteen': 86...},...}
# print mac_time_dic
# print list(mac_time_dic.iteritems())
# print list(mac_time_dic.values())
# list1 = list(mac_time_dic.values())
# print list(mac_time_dic.keys())
frame = DataFrame(list(mac_time_dic.values()),columns=['male_dor','famale_dor','postgraduate_dor','net','hospital','canteen','edu','stadium','activity','administration','library'],index=list(mac_time_dic.keys()))
#转换成dataframe格式,并且以mac为index
frame = frame.dropna(how='all')
#去掉NA项
frame = frame.fillna(0)
#用0填充NA项
frame.to_csv('./data/user_time_array_includex.csv')
frame.to_csv('./data/user_time_array.csv',index=False,header=False)
4.人员频次分布矩阵生成
接3,由于android和iOS操作系统的区别——前者开启wifi后锁屏会继续连接,而后者锁屏后过一小段时间会退出无线连接,因此以时间长度来衡量人员特征不够准确,于是希望以人员频次为单位建立人员关于地点的向量矩阵。
Ps:希望对特定区域划分时间段来区分人群,比如教学楼7:00-22:00和其他时间两个时间段,借此划分人群
因此在以上基础上又扩充了一些数据运算操作
python代码1:
不需要划分时间段的地点频次统计
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
统计每个人时间特征矩阵(地点分布)
地点list
male_dor=[141,145,146,149,151]
famale_dor=[148,150,152,153]
postgraduate_dor=[142]
net=[217,229]
hospital=[192]
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
stadium=[189,190,191]
activity=[183,184,185,186,187,188,232]
administration=[221,222,223]
library=[193,194,195,227]
最终数据结构:{'教学楼(07:00-22:00)': 1, '教学楼(其他时段)': 0, '男生宿舍': 0, '研究生宿舍': 0, '女生宿舍': 0, '学生活动中心(07:00-21:00)': 0, '学生活动中心(其他时段)': 0, '行政楼(07:00-21:00)': 0, '行政楼(其他时段)': 0, '实验楼(07:00-21:00)': 0, '实验楼(其他时段)': 0, '食堂(07:00-23:00)': 0, '食堂(其他时段)': 0}
edu,edu1,male_dor,postgraduate_dor,famale_dor,activity,activity1,administration,administration1,lab,lab1,canteen,canteen1,library,hospital,stadium
"""
mac_count_dic = dict()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
for line in file:
# print line
line = line.split(',')
line[-1] = line[-1].strip('\n')
day = line[0][5:10]
if line[4] not in mac_count_dic:
mac_count_dic[line[4]] = dict()
if line[3] not in mac_count_dic[line[4]]:
mac_count_dic[line[4]][line[3]] = dict()
mac_count_dic[line[4]][line[3]][day] = mac_count_dic[line[4]][line[3]].get(day,0)+1
#建立嵌套mac
#mac_count_dic['mac']['地点'] [日期集合]
# print mac_count_dic
rs = open('./data/user_count_array_includex.csv','w')
for key in mac_count_dic:
#遍历得到的字典
mac = key
dis = mac_count_dic[key]
#相当于解嵌套
if dis.has_key('male_dor') == True:
male_dor_count = len(dis['male_dor'])
if dis.has_key('male_dor') == False:
male_dor_count = 0
if dis.has_key('famale_dor') == True:
famale_dor_count = len(dis['famale_dor'])
if dis.has_key('famale_dor') == False:
famale_dor_count = 0
if dis.has_key('postgraduate_dor') == True:
postgraduate_dor_count = len(dis['postgraduate_dor'])
if dis.has_key('postgraduate_dor') == False:
postgraduate_dor_count = 0
if dis.has_key('net') == True:
net_count = len(dis['net'])
if dis.has_key('net') == False:
net_count = 0
if dis.has_key('hospital') == True:
hospital_count = len(dis['hospital'])
if dis.has_key('hospital') == False:
hospital_count = 0
if dis.has_key('stadium') == True:
stadium_count = len(dis['stadium'])
if dis.has_key('stadium') == False:
stadium_count = 0
rs.write(str(mac)+','+str(male_dor_count)+','+str(famale_dor_count)+','+str(postgraduate_dor_count)+','+str(net_count)+','+str(hospital_count)+','+str(stadium_count).strip('\n')+'\n')
rs.close()
#mac,male_count,famale_count,...
#mac为索引
同理得到7:00-22:00时间段内的频次字典/extra时间段内的频次字典
建立三个dataframe对象,命名为df1,df2,df3
python代码2:
dataframe对象合并
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
df1 = pd.read_csv('./data/user_count_array_includex_1.csv',names=['canteen','edu','lab','activity','administration','library'])
df2 = pd.read_csv('./data/user_count_array_includex_1_extra.csv',names=['canteen_extra','edu_extra','lab_extra','activity_extra','administration_extra','library_extra'])
df3 = pd.read_csv('./data/user_count_array_includex.csv',names=['male_dor','famale_dor','postgraduate_dor','net','hospital','stadium'])
# print len(df1)
# print len(df2)
# print len(df3)
df = df2.join(df1)
# print df
df = df.join(df3)
df = df.dropna(how='all')
df = df.fillna(0)
# print df
df.to_csv('./data/user_TimeArray_includex.csv')
#生成有索引的csv
df.to_csv('./data/user_TimeArray.csv',index=False,header=False)
#生成无索引csv
至此就完成了人员频次向量矩阵的生成
矩阵片段:
,canteen_extra,edu_extra,lab_extra,activity_extra,administration_extra,library_extra,canteen,edu,lab,activity,administration,library,male_dor,famale_dor,postgraduate_dor,net,hospital,stadium
483b38cac86d,15,10,0,0,0,3,15.0,9.0,0.0,0.0,0.0,3.0,0,12,0,3,1,1
786256354ae3,9,1,1,0,0,0,9.0,1.0,1.0,0.0,0.0,0.0,5,0,0,0,1,0
908d6c7faa0c,7,13,0,0,0,2,7.0,13.0,0.0,0.0,0.0,2.0,0,6,0,0,0,0
4c49e31c7c69,20,10,3,6,13,19,20.0,10.0,3.0,6.0,13.0,19.0,0,1,22,0,4,3
58449877c1c5,3,7,8,0,2,1,3.0,7.0,8.0,0.0,2.0,1.0,0,0,4,0,0,2
64cc2e771dd3,21,10,6,0,2,6,21.0,10.0,6.0,0.0,2.0,6.0,3,38,0,0,4,1
9cb2b2c7ad65,3,10,2,0,10,0,3.0,10.0,2.0,0.0,10.0,0.0,0,0,0,1,0,0
742344e4ff39,10,3,1,0,0,1,10.0,3.0,1.0,0.0,0.0,1.0,5,0,0,0,0,0
1ccde57a678a,7,4,0,0,0,6,7.0,4.0,0.0,0.0,0.0,6.0,11,0,0,0,1,0
ecdf3ad00c44,15,9,3,0,0,0,15.0,9.0,3.0,0.0,0.0,0.0,0,13,0,1,0,0
f431c39cf8cc,8,4,0,0,0,0,8.0,4.0,0.0,0.0,0.0,0.0,12,0,0,2,1,1
f40e22420be9,18,32,14,0,11,8,17.0,32.0,12.0,0.0,11.0,8.0,5,0,0,0,0,1
68fb7eee63e9,13,6,0,0,1,1,13.0,6.0,0.0,0.0,1.0,1.0,0,15,0,0,2,0
205d47642a4c,17,12,4,2,0,1,17.0,12.0,4.0,2.0,0.0,1.0,27,4,0,0,2,12
b0e235c341d5,13,11,0,1,0,1,13.0,11.0,0.0,1.0,0.0,1.0,0,18,0,0,0,0
在下一篇准备对于ARIMA模型进行描述和研究