看了不少书,不同的算法中经常使用不同的距离函数。然而,基本都是在一个文件里只用特定的某一种距离函数,缺乏一个整体的距离包来直接用于调用,不是特别方便。因此,我写了个将各种常用距离函数放在一起的包,包括了欧氏距离、曼哈顿距离、切比雪夫距离和余弦相似度,马氏距离暂时就先不管了,做推荐或者医学数据挖掘用的也确实不算多,而且scipy里有scipy.spatial.distance包可以调用。本文涉及的代码主要参考了《写给程序员的数据挖掘实践指南》一书。
欧氏距离 (Euclidean Distance):
这个还是比较容易理解的,就是两点之间连线的长度,对于两个n维的向量rating1和rating2(原书中计算距离的主要目的是根据评分进行推荐,所以变量名取成了这个样子),欧氏距离可以计算如下(已经导入了numpy):
def euclidean(rating1, rating2):
distance = 0
for key in rating1:
if key in rating2:
temp = (rating1[key] - rating2[key]) ** 2
distance += temp
distance = sqrt(distance)
return distance
又称城市街区距离,顾名思义,就像穿行在曼哈顿的街区一样,只能走直线然后拐弯在接着走直线,而不是两点之间直线最短,毕竟前面都是居民区对不对...
def manhattan(rating1, rating2):
distance = 0
for key in rating1:
if key in rating2:
distance += abs(rating1[key] - rating2[key]) # sum of the abs value of (a - b)
return distance
一张国际象棋的棋盘可以说明这个距离。大家都知道国际象棋中的后非常自由,可以用任意方法走任意多格,切比雪夫距离就相当于问后从棋盘上一个位置到另一个位置最少需要走多少步(假设一步只能走一格,但这一格可以是朝任何方向,直线斜线均可,用图像处理的概念说就是8邻域内的任何一个位置都可以一步走到)。举个例子,从(1,1)位置走到(4,5)位置,后可以先斜向走到(2,2),然后(3,3),再然后(4,4),最后直走一步到(4,5),也就是一共需要走4步,因此这两点的切比雪夫距离就是4。上代码:
def chebyshev(rating1, rating2):
distance = 0
for key in rating1:
if key in rating2:
max = abs(rating2[key] - rating1[key]) # max(|x1-x2|, |y1-y2|, ... ,|z1-z2|)
if max > distance:
distance = max # max value iteration
return distance
以上三种距离本质都是闵可夫斯基(Minkowski)距离的特定形式,切比雪夫距离中n取了无穷大,曼哈顿距离中n=1,而欧氏距离中n=2.
简言之,就是计算两个向量之间夹角的余弦值。由于限定了值域为[-1,1],这种方法在向量非常稀疏时效果很好,可以有效避免稀疏度不同在距离绝对值上造成的影响。因此,在推荐以及文本分析中得到了非常广泛的应用。
def cosine(rating1, rating2):
distance = 0
sum = 0
dis1 = 0
dis2 = 0
for key in rating1:
if key in rating2:
sum += rating1[key] * rating2[key]
dis1 += rating1[key] ** 2
dis2 += rating2[key] ** 2
dis = sqrt(dis1) * sqrt(dis2) # the loop can be described in matlab as x * y / sqrt(x. * x) / sqrt(y. * y)
distance = sum / dis
return distance
# __author__ = 'lzmutd'
# __date__ = May 17th, 2016
from math import sqrt
def manhattan(rating1, rating2):
distance = 0
for key in rating1:
if key in rating2:
distance += abs(rating1[key] - rating2[key]) # sum of the abs value of (a - b)
return distance
def euclidean(rating1, rating2):
distance = 0
for key in rating1:
if key in rating2:
temp = (rating1[key] - rating2[key]) ** 2
distance += temp
distance = sqrt(distance)
return distance
def chebyshev(rating1, rating2):
# distance = 0
for key in rating1:
if key in rating2:
max = abs(rating2[key] - rating1[key]) # max(|x1-x2|, |y1-y2|, ... ,|z1-z2|)
if max > distance:
distance = max # max value iteration
return distance
def cosine(rating1, rating2):
# distance = 0
sum = 0
dis1 = 0
dis2 = 0
for key in rating1:
if key in rating2:
sum += rating1[key] * rating2[key]
dis1 += rating1[key] ** 2
dis2 += rating2[key] ** 2
dis = sqrt(dis1) * sqrt(dis2) # the loop can be described in matlab as x * y / sqrt(x. * x) / sqrt(y. * y)
distance = sum / dis
return distance
# following is a demo for test
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Deadmau5": 4.0, "Phoenix": 2.0,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
"Deadmau5": 4.5, "Phoenix": 3.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0},
"Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0,
"Norah Jones": 5.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
"Norah Jones": 3.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
"Phoenix": 4.0, "Slightly Stoopid": 2.5,
"The Strokes": 3.0}
}
d = euclidean(users['Hailey'], users['Veronica'])
print d