R包rdist、Python sklearn计算pairwise distance

R包rdist计算pairwise distance

最近想要计算pairwise distance,使用嵌套循环可以解决,但是当矩阵太大的时,就计算慢很多(0.5h到几天的差别)。

1. rdist包介绍

默认使用数字数据,如果使用字符,需要自定义函数,不过可能使运行减慢很多:
rdist包可以计算pairwise distance,支持的距离计算方法有:

"euclidean": sqrt(sum_i((v_i - w_i)^2))
"minkowski": (sum_i(|v_i - w_i|^p))^{1/p}
"manhattan": sum_i(|v_i-w_i|)
"maximum" or "chebyshev": max_i(|v_i-w_i|)
"canberra": sum_i(|v_i-w_i|/(|v_i|+|w_i|))
"angular": arccos(cor(v, w))
"correlation": sqrt((1-cor(v, w))/2)
"absolute_correlation": sqrt((1-|cor(v, w)|^2))
"hamming": sum_i(v_i != w_i)/sum_i(1)
"jaccard": sum_i(v_i != w_i)/sum_i(v_i != 0 or w_i != 0)
Any function that defines a distance between two vectors.
2. 数字型和自定义函数计算字符型距离
## 数字型例子
require(rdist)
a=matrix(c(1,2,3,4,2,3,3,4,3,3,3,3),3)
1-pdist(a,metric="hamming")
1-pdist(a,metric="jaccard")

## 字符型自定义函数(定义剔除NN行的相似性比例)
b=t(matrix(c("CC","DD","CC","DD","NN","AA","CC","NN","BB","CC","DD","AA"),2,byrow = T))
b
myfun=function(v,w){
  idx=intersect(which(v!="NN"),which(w!="NN"))
  x1=v[idx]
  x2=w[idx]
  #x2[8]=10
  dis=sum(x1==x2)/length(x1)
  return(dis)  } 

pdist(t(b),metric = myfun)
     [,1] [,2]
[1,]  1.0  0.5
[2,]  0.5  1.0

b
     [,1] [,2]
[1,] "CC" "CC"
[2,] "DD" "NN"
[3,] "CC" "BB"
[4,] "DD" "CC"
[5,] "NN" "DD"
[6,] "AA" "AA"
3. Python包sklearn中调用pairwise_distances
from sklearn.metrics.pairwise import pairwise_distances
##### pairwise distance example in sklearn
data = pd.read_csv("filename.hmp.txt", header=0, sep="\t") ##  hmp genotype data
print(data.shape)
print(data.iloc[1:5,10:14])

data[data=="AA"]=1
data[data=="GG"]=2
data[data=="CC"]=3
data[data=="TT"]=4

data[data=="AG"]=5
data[data=="GA"]=5
data[data=="AC"]=6
data[data=="CA"]=6

data[data=="AT"]=7
data[data=="TA"]=7
data[data=="TC"]=8
data[data=="CT"]=8

data[data=="TG"]=9
data[data=="GT"]=9
data[data=="CG"]=10
data[data=="GC"]=10

data[data=="NN"]=11

def cal_dis2(n1,n2): ## default X
    x1=n1[(n1 !=11) & (n2 != 11)]
    x2=n2[(n1 !=11) & (n2 != 11)]
    x3=sum(x1==x2)/len(x1)
    return x3

d=pairwise_distances(e1.values.T, metric=cal_dis2,n_jobs=10)

出现的问题是:

1.服务器不能调用指定的cpu个数
2.自定义函数简单,传入pairwise_distances函数时仍然计算很慢,反而默认的方法计算很快。

仅为记录。

参考:
https://www.rdocumentation.org/packages/fields/versions/13.3/topics/rdist
https://scikit-learn.org/0.15/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html

你可能感兴趣的:(R语言,r语言,python)