真是好久没更新了,因为本蓝最近在忙研究生复试的事。结果还不错,虽然被调剂了,但是学校还算满意。就是过程有点太坎坷,也算是成长吧~今天接着之前的出租车数据处理!
预处理完成后,我们也该弄一个算法了,我寻思半天,学习了聚类,今天分享一下~
从头开始:由于数据量很大,我们就选择周末来处理,代码如下:
fpath = "C:/Users/PC-2019-3-24/Desktop/文件/数据/taix/000000001.csv"#打开神秘的文件
data = pd.read_csv(fpath,names=["id","jingdu","weidu","date","time"])#给每个列取个名字,经度纬度用拼音代替
#print(data['date'])
data1=[]#设置空数组
data2=[]
data3=[]
data4=[]
n=0
for i in data['date']:
if i == '2008-02-02'or i=='2008-02-03':
print(i)
n=n+1
data1.append(data.loc[n]) #将合格的宝贝加入新数组
data1=pd.DataFrame(data1)
#print(data1['jingdu'])
看上去有点low,但是好歹实现了我们想要的功能O(∩_∩)O哈哈~这里有个loc函数,我们一会儿说。
下一步,咱们给他抽出来经纬度聚类
X1=data['jingdu']
X2=data['weidu']
X=np.vstack((X1,X2)).T
接下来标准化(我导师说我英语有点差,从下面开始注释用英语写了,练练手,哈哈哈哈)
X = preprocessing.scale(X)
X = pd.DataFrame(X)
突然不开心了,剩下的不愿意写了,大家凑合看吧
#Build an empty list to save the results under different parameter combinations
res = []
for eps in np.arange(0.001 , 1 , 0.05):
#Iterating different values of min_samples
for min_samples in range(2,10):
dbscan = cluster.DBSCAN(eps = eps , min_samples = min_samples)
dbscan.fit(X)
#Count the number of clusters under each parameter combination (- 1 indicates abnormal point)
n_clusters = len([i for i in set(dbscan.labels_) if i != -1])
#Number of outliers
outliners = np.sum(np.where(dbscan.labels_ == -1 , 1 , 0))
#Count the number of samples in each cluster
stats = str(pd.Series([i for i in dbscan.labels_ if i != -1]).value_counts().values)
res.append({
'eps':eps , 'min_samples':min_samples , 'n_clusters':n_clusters , 'outliners':outliners , 'stats':stats})
df = pd.DataFrame(res)
#Select reasonable parameter combination according to conditions
df.loc[df.n_clusters == 8, :]
#df.loc[df.n_clusters , :]
cluster可调,根据自己喜欢的来判断,也可以全部打印,但是太多了。
最后出图!
y_pred = DBSCAN(eps=0.151, min_samples=4, algorithm='ball_tree', metric='euclidean').fit_predict(X)
plt.scatter(X[0], X[1], c=y_pred)
plt.show()
loc是提取列,按照标签提取