Python距离包的实现

看了不少书,不同的算法中经常使用不同的距离函数。然而,基本都是在一个文件里只用特定的某一种距离函数,缺乏一个整体的距离包来直接用于调用,不是特别方便。因此,我写了个将各种常用距离函数放在一起的包,包括了欧氏距离、曼哈顿距离、切比雪夫距离和余弦相似度,马氏距离暂时就先不管了,做推荐或者医学数据挖掘用的也确实不算多,而且scipy里有scipy.spatial.distance包可以调用。本文涉及的代码主要参考了《写给程序员的数据挖掘实践指南》一书。


欧氏距离 (Euclidean Distance):

这个还是比较容易理解的,就是两点之间连线的长度,对于两个n维的向量rating1rating2(原书中计算距离的主要目的是根据评分进行推荐,所以变量名取成了这个样子),欧氏距离可以计算如下(已经导入了numpy):

def euclidean(rating1, rating2):
    distance = 0
    for key in rating1:
        if key in rating2:
            temp = (rating1[key] - rating2[key]) ** 2
            distance += temp
    distance = sqrt(distance)
    return distance

曼哈顿距离 (Manhattan Distance):

又称城市街区距离,顾名思义,就像穿行在曼哈顿的街区一样,只能走直线然后拐弯在接着走直线,而不是两点之间直线最短,毕竟前面都是居民区对不对...

def manhattan(rating1, rating2):
    distance = 0
    for key in rating1:
        if key in rating2:
            distance += abs(rating1[key] - rating2[key]) # sum of the abs value of (a - b)
    return distance

切比雪夫距离 (Chebyshev Distance):

一张国际象棋的棋盘可以说明这个距离。大家都知道国际象棋中的后非常自由,可以用任意方法走任意多格,切比雪夫距离就相当于问后从棋盘上一个位置到另一个位置最少需要走多少步(假设一步只能走一格,但这一格可以是朝任何方向,直线斜线均可,用图像处理的概念说就是8邻域内的任何一个位置都可以一步走到)。举个例子,从(1,1)位置走到(4,5)位置,后可以先斜向走到(2,2),然后(3,3),再然后(4,4),最后直走一步到(4,5),也就是一共需要走4步,因此这两点的切比雪夫距离就是4。上代码:

def chebyshev(rating1, rating2):
    distance = 0
    for key in rating1:
        if key in rating2:
            max = abs(rating2[key] - rating1[key]) # max(|x1-x2|, |y1-y2|, ... ,|z1-z2|)
            if max > distance:
                distance = max     # max value iteration
    return distance


以上三种距离本质都是闵可夫斯基(Minkowski)距离的特定形式,切比雪夫距离中n取了无穷大,曼哈顿距离中n=1,而欧氏距离中n=2.


余弦相似度(Cosine Similarity):

简言之,就是计算两个向量之间夹角的余弦值。由于限定了值域为[-1,1],这种方法在向量非常稀疏时效果很好,可以有效避免稀疏度不同在距离绝对值上造成的影响。因此,在推荐以及文本分析中得到了非常广泛的应用。

def cosine(rating1, rating2):
    distance = 0
    sum = 0
    dis1 = 0
    dis2 = 0
    for key in rating1:
        if key in rating2:
            sum += rating1[key] * rating2[key]
            dis1 += rating1[key] ** 2
            dis2 += rating2[key] ** 2
            dis = sqrt(dis1) * sqrt(dis2)   #  the loop can be described in matlab as x * y / sqrt(x. * x) / sqrt(y. * y)
    distance = sum / dis
    return distance


可以看到,在本文涉及的代码中,对于其中两个向量中元素没有同时出现的情形,处理方法是完全丢弃这些元素。具体到推荐系统中,就是同一个物品其中一人评分而另一人没有进行评分的情况。完整的距离包代码如下,最后附上了测试可以使用的一个小型(其实是微型)字典数据集合一个调用的demo,需要调用距离函数时把数据和调用语句注释掉就好。


# __author__ = 'lzmutd'
# __date__ = May 17th, 2016

from math import sqrt
def manhattan(rating1, rating2):
    distance = 0
    for key in rating1:
        if key in rating2:
            distance += abs(rating1[key] - rating2[key]) # sum of the abs value of (a - b)
    return distance

def euclidean(rating1, rating2):
    distance = 0
    for key in rating1:
        if key in rating2:
            temp = (rating1[key] - rating2[key]) ** 2
            distance += temp
            distance = sqrt(distance)
    return distance

def chebyshev(rating1, rating2):
    # distance = 0
    for key in rating1:
        if key in rating2:
            max = abs(rating2[key] - rating1[key]) # max(|x1-x2|, |y1-y2|, ... ,|z1-z2|)
            if max > distance:
                distance = max     # max value iteration
    return distance

def cosine(rating1, rating2):
    # distance = 0
    sum = 0
    dis1 = 0
    dis2 = 0
    for key in rating1:
        if key in rating2:
            sum += rating1[key] * rating2[key]
            dis1 += rating1[key] ** 2
            dis2 += rating2[key] ** 2
            dis = sqrt(dis1) * sqrt(dis2)   #  the loop can be described in matlab as x * y / sqrt(x. * x) / sqrt(y. * y)
    distance = sum / dis
    return distance

# following is a demo for test

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }

d = euclidean(users['Hailey'], users['Veronica'])
print d





你可能感兴趣的:(推荐系统,编程实战,python)