一月份工作记录


关键字
K_means、ARIMA


前言

一月份主要工作如下:

精细化数据预处理
过滤掉单一地点mac、过滤掉出现天数低于10天的mac、进一步细分地点列表;

数据索引
保留两份原始数据,以不同的索引保存,便于后续检索
a.时间戳、地点->mac
b.日期、mac->时间段:地点

人员数目分布统计

聚类准备
将人员关于地点的时间分布以ndarray的形式呈现(经过数据处理)


1.

第一部分的工作只是简单的修改了之前的代码,内容意义不是很多,所以这里就不详细记录啦~
数据索引这块,详细记录一下通过日期和mac索引到place id的过程:

输入参数:

start_time  开始时间
end_time    结束时间
mac         索引的mac地址对象

输出

stime1,etime1,pid1  停留时间段1
stime2,etime2,pid2  停留时间段2
...
stimen,etimen,pidn  停留时间段n

数据片段

2017-09-11 00:00:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:00:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:00:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:00:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:00:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:00:00,0,149,dormitory,4c49e3406f61,N
2017-09-11 00:00:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:00:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:00:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:00:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:00:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:00:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:00:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:00:00,0,168,edu,74042bcb3a77,N
2017-09-11 00:00:00,0,181,canteen,40f02f4c670d,N
2017-09-11 00:00:00,0,193,edu,8844773c62e3,N
2017-09-11 00:00:00,0,240,canteen,4c1a3d3f0f21,N
2017-09-11 00:01:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:01:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:01:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:01:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:01:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:01:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:01:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:01:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:01:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:01:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:01:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:01:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:01:00,0,168,edu,74042bcb3a77,N

python代码:

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'
"""
数据维度变换2
日期、mac->时间段:地点

重新索引后的数据格式:
起始时间1,终止时间1,place id1
起始时间2,终止时间2,place id2
起始时间3,终止时间3,place id3
...
并得到规定时间内的轨迹数组:
[142, 202, 142, 202, 200, 202, 200, 142, 142](example)
输入参数:mac地址、开始时间、结束时间
"""

start_time ='2017-09-11 00:00:00'
end_time ='2017-09-18 00:00:00'
mac ='205d4717e6de'

def findpathByMacDate(mac,start_time,end_time):

    records = pd.read_csv('./macdata/normalinfo_trans.txt',names=['timestamp','timerange','pid','ptype','mac','isholiday'])
    #读取源数据,并指明列名(时间、时间范围、地点id、地点类型、mac、是否为节假日)
    records_select = records[(records['mac']==mac) &(records['timestamp'] >start_time) &(records['timestamp'] 

数据重新索引看上去比较麻烦,巧妙运用pandas进行数据聚合、筛选操作,发现代码量并不多,很容易就实现了~


2.人员分布统计

工作内容:

根据日期、时间段、地点类型(地点)等三个维度统计mac数量。柱状图同时显示两个维度(固定第三个维度),显示时可以切换第三个维度便于观察特征

输入:start_time,end_time
按天输出:不同地点类型的mac数量
按时段输出:不同地点类型的mac数量

返回文件属性说明:
宿舍,食堂,教学楼,体育馆/学生活动中心

python代码

# -*- coding: UTF-8 -*-

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime

__author__ = 'SuZibo'

"""
根据日期、时间段、地点类型(地点)等三个维度统计mac数量。柱状图同时显示两个维度(固定第三个维度),显示时可以切换第三个维度便于观察特征

输入:start_time,end_time
按天输出:不同地点类型的mac数量
按时段输出:不同地点类型的mac数量

返回文件属性说明:
宿舍,食堂,教学楼,体育馆/学生活动中心
"""

dormitory =[141,142,145,146,148,149,150,151,152,153]
canteen =[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu =[54,60,133,134,136,154,155,156,157,158,159,160,161,162,164,165,166,167,168,169,193,194,195,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,227,230,231,233,234,235,236,237,238,239]
stadium =[183,184,185,186,187,188,189,190,191,232]

stime ='2017-09-11 00:00:00'
# etime ='2017-09-12 00:00:00'
etime ='2017-09-12 00:00:00'
#小循环里面的时间上限和下限

weekdaylist =[]
start_date = '2017-09-11'
# end_date = '2017-11-13'
end_date='2017-11-13'
#大循环的时间上限和下限

sdate = datetime.datetime.strptime(start_date,'%Y-%m-%d')
edate = datetime.datetime.strptime(end_date,'%Y-%m-%d')
while sdate

关于人员统计,需要熟练运用python字典里面的get方法
简要陈述字典get方法:

语法
get()方法语法:

dict.get(key, default=None)

参数
key -- 字典中要查找的键。
default -- 如果指定键的值不存在时,返回该默认值值。

返回值
返回指定键的值,如果值不在字典中返回默认值None。


3.人员时间分布矩阵获取

工作内容:
以male_dor,famale_dor,postgraduate_dor,net,hospital,canteen,edu,lab,stadium,activity,administration,library为属性
建立人员出现时长矩阵(以mac为索引)

python代码:

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'

"""
统计每个人时间特征矩阵(地点分布)

地点list
male_dor=[141,145,146,149,151]
#男生宿舍
famale_dor=[148,150,152,153]
#女生宿舍
postgraduate_dor=[142]
#研究生宿舍
net=[217,229]
#网络中心
hospital=[192]
#校医院
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
#食堂
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
#教学楼
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
#实验室
stadium=[189,190,191]
#体育馆
activity=[183,184,185,186,187,188,232]
#学生活动中心
administration=[221,222,223]
#行政楼
library=[193,194,195,227]
#图书馆
"""
mac_time_dic =dict()
#建立字典存储mac对应的时间统计信息,因为源数据的时间周期为1min,利用此特性累加得到的结果正好就是时长(单位为min)

# start_time ='2017-09-11 00:00:00'
# end_time ='2017-11-13 00:00:00'

# frame_data = pd.read_csv('../macinfo/macdata/normalinfo_trans_v2.txt',header=None)
# print frame_data.tail()

with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:

    for line in file:
        # print line
        line = line.split(',')
        line[-1] = line[-1].strip('\n')

        if line[4] not in mac_time_dic:
            mac_time_dic[line[4]] = dict()
        mac_time_dic[line[4]][line[3]] = mac_time_dic[line[4]].get(line[3], 0) + 1
#{mac1:{place1:m,place2:n,...},...}
#{'10b1f8f3a4d0': {'famale_dor': 10, 'male_dor': 507, 'hospital': 10, 'activity': 10, 'library': 4, 'edu': 41, 'canteen': 86...},...}

# print mac_time_dic
# print list(mac_time_dic.iteritems())
# print list(mac_time_dic.values())
# list1 = list(mac_time_dic.values())
# print list(mac_time_dic.keys())

frame = DataFrame(list(mac_time_dic.values()),columns=['male_dor','famale_dor','postgraduate_dor','net','hospital','canteen','edu','stadium','activity','administration','library'],index=list(mac_time_dic.keys()))
#转换成dataframe格式,并且以mac为index
frame = frame.dropna(how='all')
#去掉NA项
frame = frame.fillna(0)
#用0填充NA项
frame.to_csv('./data/user_time_array_includex.csv')
frame.to_csv('./data/user_time_array.csv',index=False,header=False)

4.人员频次分布矩阵生成

接3,由于android和iOS操作系统的区别——前者开启wifi后锁屏会继续连接,而后者锁屏后过一小段时间会退出无线连接,因此以时间长度来衡量人员特征不够准确,于是希望以人员频次为单位建立人员关于地点的向量矩阵。

Ps:希望对特定区域划分时间段来区分人群,比如教学楼7:00-22:00和其他时间两个时间段,借此划分人群
因此在以上基础上又扩充了一些数据运算操作

python代码1:
不需要划分时间段的地点频次统计

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'

"""
统计每个人时间特征矩阵(地点分布)

地点list
male_dor=[141,145,146,149,151]
famale_dor=[148,150,152,153]
postgraduate_dor=[142]
net=[217,229]
hospital=[192]
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
stadium=[189,190,191]
activity=[183,184,185,186,187,188,232]
administration=[221,222,223]
library=[193,194,195,227]


最终数据结构:{'教学楼(07:00-22:00)': 1, '教学楼(其他时段)': 0, '男生宿舍': 0, '研究生宿舍': 0, '女生宿舍': 0, '学生活动中心(07:00-21:00)': 0, '学生活动中心(其他时段)': 0, '行政楼(07:00-21:00)': 0, '行政楼(其他时段)': 0, '实验楼(07:00-21:00)': 0, '实验楼(其他时段)': 0, '食堂(07:00-23:00)': 0, '食堂(其他时段)': 0}
edu,edu1,male_dor,postgraduate_dor,famale_dor,activity,activity1,administration,administration1,lab,lab1,canteen,canteen1,library,hospital,stadium
"""

mac_count_dic = dict()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
    for line in file:
        # print line
        line = line.split(',')
        line[-1] = line[-1].strip('\n')
        day = line[0][5:10]

        if line[4] not in mac_count_dic:
            mac_count_dic[line[4]] = dict()

        if line[3] not in mac_count_dic[line[4]]:
            mac_count_dic[line[4]][line[3]] = dict()

        mac_count_dic[line[4]][line[3]][day] = mac_count_dic[line[4]][line[3]].get(day,0)+1
#建立嵌套mac
#mac_count_dic['mac']['地点'] [日期集合]
# print mac_count_dic

rs = open('./data/user_count_array_includex.csv','w')

for key in mac_count_dic:
#遍历得到的字典
    mac = key
    dis = mac_count_dic[key]
    #相当于解嵌套
    if dis.has_key('male_dor') == True:
        male_dor_count = len(dis['male_dor'])
    if dis.has_key('male_dor') == False:
        male_dor_count = 0

    if dis.has_key('famale_dor') == True:
        famale_dor_count = len(dis['famale_dor'])
    if dis.has_key('famale_dor') == False:
        famale_dor_count = 0

    if dis.has_key('postgraduate_dor') == True:
        postgraduate_dor_count = len(dis['postgraduate_dor'])
    if dis.has_key('postgraduate_dor') == False:
        postgraduate_dor_count = 0

    if dis.has_key('net') == True:
        net_count = len(dis['net'])
    if dis.has_key('net') == False:
        net_count = 0

    if dis.has_key('hospital') == True:
        hospital_count = len(dis['hospital'])
    if dis.has_key('hospital') == False:
        hospital_count = 0

    if dis.has_key('stadium') == True:
        stadium_count = len(dis['stadium'])
    if dis.has_key('stadium') == False:
        stadium_count = 0

rs.write(str(mac)+','+str(male_dor_count)+','+str(famale_dor_count)+','+str(postgraduate_dor_count)+','+str(net_count)+','+str(hospital_count)+','+str(stadium_count).strip('\n')+'\n')
rs.close()
#mac,male_count,famale_count,...
#mac为索引

同理得到7:00-22:00时间段内的频次字典/extra时间段内的频次字典
建立三个dataframe对象,命名为df1,df2,df3

python代码2:
dataframe对象合并

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'

df1 = pd.read_csv('./data/user_count_array_includex_1.csv',names=['canteen','edu','lab','activity','administration','library'])
df2 = pd.read_csv('./data/user_count_array_includex_1_extra.csv',names=['canteen_extra','edu_extra','lab_extra','activity_extra','administration_extra','library_extra'])
df3 = pd.read_csv('./data/user_count_array_includex.csv',names=['male_dor','famale_dor','postgraduate_dor','net','hospital','stadium'])

# print len(df1)
# print len(df2)
# print len(df3)

df = df2.join(df1)
# print df
df = df.join(df3)
df = df.dropna(how='all')
df = df.fillna(0)
# print df
df.to_csv('./data/user_TimeArray_includex.csv')
#生成有索引的csv
df.to_csv('./data/user_TimeArray.csv',index=False,header=False)
#生成无索引csv

至此就完成了人员频次向量矩阵的生成

矩阵片段:

,canteen_extra,edu_extra,lab_extra,activity_extra,administration_extra,library_extra,canteen,edu,lab,activity,administration,library,male_dor,famale_dor,postgraduate_dor,net,hospital,stadium
483b38cac86d,15,10,0,0,0,3,15.0,9.0,0.0,0.0,0.0,3.0,0,12,0,3,1,1
786256354ae3,9,1,1,0,0,0,9.0,1.0,1.0,0.0,0.0,0.0,5,0,0,0,1,0
908d6c7faa0c,7,13,0,0,0,2,7.0,13.0,0.0,0.0,0.0,2.0,0,6,0,0,0,0
4c49e31c7c69,20,10,3,6,13,19,20.0,10.0,3.0,6.0,13.0,19.0,0,1,22,0,4,3
58449877c1c5,3,7,8,0,2,1,3.0,7.0,8.0,0.0,2.0,1.0,0,0,4,0,0,2
64cc2e771dd3,21,10,6,0,2,6,21.0,10.0,6.0,0.0,2.0,6.0,3,38,0,0,4,1
9cb2b2c7ad65,3,10,2,0,10,0,3.0,10.0,2.0,0.0,10.0,0.0,0,0,0,1,0,0
742344e4ff39,10,3,1,0,0,1,10.0,3.0,1.0,0.0,0.0,1.0,5,0,0,0,0,0
1ccde57a678a,7,4,0,0,0,6,7.0,4.0,0.0,0.0,0.0,6.0,11,0,0,0,1,0
ecdf3ad00c44,15,9,3,0,0,0,15.0,9.0,3.0,0.0,0.0,0.0,0,13,0,1,0,0
f431c39cf8cc,8,4,0,0,0,0,8.0,4.0,0.0,0.0,0.0,0.0,12,0,0,2,1,1
f40e22420be9,18,32,14,0,11,8,17.0,32.0,12.0,0.0,11.0,8.0,5,0,0,0,0,1
68fb7eee63e9,13,6,0,0,1,1,13.0,6.0,0.0,0.0,1.0,1.0,0,15,0,0,2,0
205d47642a4c,17,12,4,2,0,1,17.0,12.0,4.0,2.0,0.0,1.0,27,4,0,0,2,12
b0e235c341d5,13,11,0,1,0,1,13.0,11.0,0.0,1.0,0.0,1.0,0,18,0,0,0,0

在下一篇准备对于ARIMA模型进行描述和研究

你可能感兴趣的:(一月份工作记录)