最近想要计算pairwise distance,使用嵌套循环可以解决,但是当矩阵太大的时,就计算慢很多(0.5h到几天的差别)。
默认使用数字数据,如果使用字符,需要自定义函数,不过可能使运行减慢很多:
rdist包可以计算pairwise distance,支持的距离计算方法有:
"euclidean": sqrt(sum_i((v_i - w_i)^2))
"minkowski": (sum_i(|v_i - w_i|^p))^{1/p}
"manhattan": sum_i(|v_i-w_i|)
"maximum" or "chebyshev": max_i(|v_i-w_i|)
"canberra": sum_i(|v_i-w_i|/(|v_i|+|w_i|))
"angular": arccos(cor(v, w))
"correlation": sqrt((1-cor(v, w))/2)
"absolute_correlation": sqrt((1-|cor(v, w)|^2))
"hamming": sum_i(v_i != w_i)/sum_i(1)
"jaccard": sum_i(v_i != w_i)/sum_i(v_i != 0 or w_i != 0)
Any function that defines a distance between two vectors.
## 数字型例子
require(rdist)
a=matrix(c(1,2,3,4,2,3,3,4,3,3,3,3),3)
1-pdist(a,metric="hamming")
1-pdist(a,metric="jaccard")
## 字符型自定义函数(定义剔除NN行的相似性比例)
b=t(matrix(c("CC","DD","CC","DD","NN","AA","CC","NN","BB","CC","DD","AA"),2,byrow = T))
b
myfun=function(v,w){
idx=intersect(which(v!="NN"),which(w!="NN"))
x1=v[idx]
x2=w[idx]
#x2[8]=10
dis=sum(x1==x2)/length(x1)
return(dis) }
pdist(t(b),metric = myfun)
[,1] [,2]
[1,] 1.0 0.5
[2,] 0.5 1.0
b
[,1] [,2]
[1,] "CC" "CC"
[2,] "DD" "NN"
[3,] "CC" "BB"
[4,] "DD" "CC"
[5,] "NN" "DD"
[6,] "AA" "AA"
pairwise_distances
from sklearn.metrics.pairwise import pairwise_distances
##### pairwise distance example in sklearn
data = pd.read_csv("filename.hmp.txt", header=0, sep="\t") ## hmp genotype data
print(data.shape)
print(data.iloc[1:5,10:14])
data[data=="AA"]=1
data[data=="GG"]=2
data[data=="CC"]=3
data[data=="TT"]=4
data[data=="AG"]=5
data[data=="GA"]=5
data[data=="AC"]=6
data[data=="CA"]=6
data[data=="AT"]=7
data[data=="TA"]=7
data[data=="TC"]=8
data[data=="CT"]=8
data[data=="TG"]=9
data[data=="GT"]=9
data[data=="CG"]=10
data[data=="GC"]=10
data[data=="NN"]=11
def cal_dis2(n1,n2): ## default X
x1=n1[(n1 !=11) & (n2 != 11)]
x2=n2[(n1 !=11) & (n2 != 11)]
x3=sum(x1==x2)/len(x1)
return x3
d=pairwise_distances(e1.values.T, metric=cal_dis2,n_jobs=10)
出现的问题是:
1.服务器不能调用指定的cpu个数
2.自定义函数简单,传入pairwise_distances函数时仍然计算很慢,反而默认的方法计算很快。
仅为记录。
参考:
https://www.rdocumentation.org/packages/fields/versions/13.3/topics/rdist
https://scikit-learn.org/0.15/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html