参考:R document
Seurat识别细胞类群的原理FindNeighbors和FindClusters
Jaccard系数_百度百科
Annoy解析
https://blog.csdn.net/qq_37696858/article/details/88143156
(Shared) Nearest-neighbor graph construction
- Description
Computes the k.param nearest neighbors for a given dataset. Can also optionally (via compute.SNN), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors.
首先计算每个细胞的KNN,也就是计算每个细胞之间的相互距离,依据细胞之间邻居的overlap来构建snn graph。
计算给定数据集的k.param最近邻。也可以选择(通过compute.SNN),通过计算每个细胞最近邻之间的邻域重叠(Jaccard索引)和其邻近的k.param来构造SNN。 - Usage
FindNeighbors(object, ...)
Default S3 method:
FindNeighbors(
object,
query = NULL,
distance.matrix = FALSE,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
index = NULL,
...
)
S3 method for class 'Assay'
FindNeighbors(
object,
features = NULL,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
...
)
S3 method for class 'dist'
FindNeighbors(
object,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
...
)
S3 method for class 'Seurat'
FindNeighbors(
object,
reduction = "pca",
dims = 1:10,
assay = NULL,
features = NULL,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
do.plot = FALSE,
graph.name = NULL,
l2.norm = FALSE,
cache.index = FALSE,
...
)
- Arguments
object
An object
...
Arguments passed to other methods
query
Matrix of data to query against object. If missing, defaults to object.
询问
根据对象查询的数据矩阵。如果缺少,默认为对象。
distance.matrix
Boolean value of whether the provided matrix is a distance matrix; note, for objects of class dist, this parameter will be set automatically
距离矩阵
提供的矩阵是否为距离矩阵的布尔值;注意,对于dist类的对象,此参数设置为 automatically
k.param
Defines k for the k-nearest neighbor algorithm
这个参数用于设置 KNN 算法中最近邻的个数,默认20
return.neighbor
Return result as Neighbor object. Not used with distance matrix input.
返回的邻居
将结果作为邻居对象返回。不适用于距离矩阵输入。
compute.SNN
also compute the shared nearest neighbor graph
计算共享邻居的数量,一般不设置。
prune.SNN
Sets the cutoff for acceptable Jaccard index when computing the neighborhood overlap for the SNN construction. Any edges with values less than or equal to this will be set to 0 and removed from the SNN graph. Essentially sets the stringency of pruning (0 — no pruning, 1 — prune everything).
计算SNN构造的领域重叠时,设置可接受的Jaccard指数的截止值。任何小于或者等于此值的任何边将被设置为0并从SNN图中删除。本质上设置修剪的严格性(0-不修剪,1-修剪所有内容)。
nn.method
Method for nearest neighbor finding. Options include: rann, annoy
这个参数提供了如何判断邻居的方法,提供的可选是rann, annoy
n.trees
More trees gives higher precision when using annoy approximate nearest neighbor search
树的数量
当使用annoy的近似最近邻搜索时,树越多,给出的精度越高。
annoy.metric
Distance metric for annoy. Options include: euclidean, cosine, manhattan, and hamming
annoy的距离,选项包括:欧几里德、余弦、曼哈顿和汉明
nn.eps
Error bound when performing nearest neighbor seach using RANN; default of 0.0 implies exact nearest neighbor search
使用RANN执行最近邻搜索时的错误界限;默认值0.0表示精确最近邻搜索
verbose
Whether or not to print output to the console
是否将输出打印到控制台
force.recalc
Force recalculation of (S)NN.
SNN强制重新计算,一般不设置
l2.norm
Take L2Norm of the data
L2正则化
cache.index
Include cached index in returned Neighbor object (only relevant if return.neighbor = TRUE)
在返回的邻居对象中包含缓存的索引(仅当return.neighbor = TRUE时相关)
index
Precomputed index. Useful if querying new data against existing index to avoid recomputing.
索引
预计算索引。如果根据现有索引查询新数据以避免重新计算,则非常有用。
features
Features to use as input for building the (S)NN; used only when dims is NULL
特征
用作构建SNN神经网络输入的特征;仅当dims为空时使用
reduction
Reduction to use as input for building the (S)NN
输入的降维方法,用于构建SNN神经网络
dims
Dimensions of reduction to use as input
输入的降维的维度
assay
Assay to use in construction of (S)NN; used only when dims is NULL
用于构造SNN神经网络的分析;仅当dims为空时使用
do.plot
Plot SNN graph on tSNE coordinates
在tSNE坐标上绘制SNN图
graph.name
Optional naming parameter for stored (S)NN graph (or Neighbor object, if return.neighbor = TRUE). Default is assay.name_(s)nn. To store both the neighbor graph and the shared nearest neighbor (SNN) graph, you must supply a vector containing two names to the graph.name parameter. The first element in the vector will be used to store the nearest neighbor (NN) graph, and the second element used to store the SNN graph. If only one name is supplied, only the NN graph is stored.
可选命名存储的神经网络图的参数(或邻居对象,如果返回的邻居=真)。默认为assay.name_(s)nn。要存储邻居图和共享最近邻居(SNN)图,必须向graph.name参数提供一个包含两个名称的向量。向量中的第一个元素将用于存储最近邻图,第二个元素用于存储SNN图。如果只提供一个名称,则只存储神经网络图。
- Value
This function can either return a Neighbor object with the KNN information or a list of Graph objects with the KNN and SNN depending on the settings of return.neighbor and compute.SNN. When running on a Seurat object, this returns the Seurat object with the Graphs or Neighbor objects stored in their respective slots. Names of the Graph or Neighbor object can be found with Graphs or Neighbors.
值
根据return.neighbor和compute.SNN的设置,该函数可以返回带有KNN信息的neighbor对象,也可以返回带有KNN和SNN的Graph对象列表。当在Seurat 对象上运行时,这将返回带有存储在各自slots中的Graph或Neighbor对象的Seurat对象。图形或邻居对象的名称可以在图形或邻居中找到。 - Examples
data("pbmc_small")
pbmc_small
Compute an SNN on the gene expression level
pbmc_small <- FindNeighbors(pbmc_small, features = VariableFeatures(object = pbmc_small))
More commonly, we build the SNN on a dimensionally reduced form of the data
such as the first 10 principle components.
pbmc_small <- FindNeighbors(pbmc_small, reduction = "pca", dims = 1:10)
1.KNN是什么?
所谓K最近邻,就是K个最近的邻居的意思,说的是每个样本都可以用它最接近的K个邻近值来代表。近邻算法就是将数据集合中每一个记录进行分类的方法。
2.jaccard index是什么?
雅卡尔指数 (Jaccard index)或者称为交并比、雅卡尔相似系数
可以用于比较样本集的相似性与多样性。其定义为两个集合交集大小与并集大小之间的比例:
定义
给定两个集合A,B,Jaccard 系数定义为A与B交集的大小与A与B并集的大小的比值,定义如下:
当A和B是空集时,定义jaccard index为1。
雅卡尔距离(Jaccard distance)
则用于量度样本集之间的不相似度,其定义为1减去雅卡尔系数。Jaccard 距离越大,样本相似度越低。公式定义如下:其中对参差(symmetric difference)
性质
-
相似性
非对称二元属性的相似性
在数据挖掘领域,常常需要比较两个具有布尔值属性的对象之间的距离,Jaccard距离就是常用的一种方法。给定两个比较对象A,B。A, B 均有n个二元属性,即每个属性取值为{0,1}。定义如下4个统计量:
如图1数示:
显然有
Jaccard 系数:
Jaccard距离:
对于非对称二元属性而言(比如说对于患癌症和不患癌症的属性而言,不患癌症是0,患癌症是1,那么0的数量远远大于1,但是我们却更关注1的数量):
M11 - A和B中都是1;
M01 - A中是0,B中是1;
M10 - A中是1,B中是0;
M00 - 两者都是0.
3.如何邻居判定?annoy、rann
-
Annoy是高维空间求近似最近邻的一个开源模块。
Annoy构建一棵二叉树,查询时间为O(logn)。
Annoy通过随机挑选两个点,并使用垂直于这个点的等距离超平面将集合划分为两部分。
如图所示,图中灰色线是连接两个点,超平面是加粗的黑线。按照这个方法在每个子集上迭代进行划分。
依此类推,直到每个集合最多剩余k个点,下图是一个k = 10 的情况。
相应的完整二叉树结构:
随机投影森林。
一个思想依据是:在原空间中相邻的点,在树结构上也表现出相互靠近的特点,也就是说,如果两个点在空间上相互靠近,那么他们很可能被树结构划分到一起。
如果要在空间中查找临近点,我们可以在这个二叉树中搜索。上图中每个节点用超平面来定义,所以我们可以计算出该节点往哪个方向遍历,搜索时间 log n
技巧1:使用优先队列
如果一个划分的两边“靠得足够近”(量化方式在后面介绍),我们就两边都遍历。这样就不只是遍历一个节点的一边,我们将遍历更多的点
我们可以设置一个阈值,用来表示是否愿意搜索划分“错”的一遍。如果设置为0,我们将总是遍历“对”的一片。但是如果设置成0.5,就按照上面的搜索路径。
这个技巧实际上是利用优先级队列,依据两边的最大距离。好处是我们能够设置比0大的阈值,逐渐增加搜索范围。
技巧2:构建一个森林
我们能够用一个优先级队列,同时搜索所有的树。这样有另外一个好处,搜索会聚焦到那些与已知点靠得最近的那些树——能够把距离最远的空间划分出去
每棵树都包含所有的点,所以当我们搜索多棵树的时候,将找到多棵树上的多个点。如果我们把所有的搜索结果的叶子节点都合在一起,那么得到的最近邻就非常符合要求。依照上述方法,我们找到一个近邻的集合,接下来就是计算所有的距离和对这些点进行排序,找到最近的k个点。
很明显,我们会丢掉一些最近的点,这也是为什么叫近似最近邻的原因。
Annoy在实际使用的时候,提供了一种机制可以调整(搜索k),你能够根据它来权衡性能(时间)和准确度(质量)。
tips:
1.距离计算,采用归一化的欧氏距离:vectors = sqrt(2-2*cos(u, v)) 即余弦距离
2.向量维度较小(<100),即使维度到达1000变现也不错
3.内存占用小
4.索引创建与查找分离(特别是一旦树已经创建,就不能添加更多项)
5.有两个参数可以用来调节Annoy 树的数量n_trees和搜索期间检查的节点数量search_k
n_trees在构建时提供,并影响构建时间和索引大小。 较大的值将给出更准确的结果,但更大的索引。
search_k在运行时提供,并影响搜索性能。 较大的值将给出更准确的结果,但将需要更长的时间返回。
如果不提供search_k,它将默认为n * n_trees,其中n是近似最近邻的数目。 否则,search_k和n_tree大致是独立的,即如果search_k保持不变,n_tree的值不会影响搜索时间,反之亦然。 基本上,建议在可用负载量的情况下尽可能大地设置n_trees,并且考虑到查询的时间限制,建议将search_k设置为尽可能大。
from annoy import AnnoyIndex
import random
f = 40 #维度
t = AnnoyIndex(f) # Length of item vector that will be indexed
for i in xrange(1000):
v = [random.gauss(0, 1) for z in xrange(f)]
t.add_item(i, v) #添加向量
t.build(10) # 10 trees
t.save('test.ann')
u = AnnoyIndex(f)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors of the first(0) vec
seurat4的annoy的源代码:
# Run annoy
#
# @param data Data to build the index with
# @param query A set of data to be queried against data
# @param metric Distance metric; can be one of "euclidean", "cosine", "manhattan",
# "hamming"
# @param n.trees More trees gives higher precision when querying
# @param k Number of neighbors
# @param search.k During the query it will inspect up to search_k nodes which
# gives you a run-time tradeoff between better accuracy and speed.
# @param include.distance Include the corresponding distances
# @param index optional index object, will be recomputed if not provided
#
AnnoyNN <- function(data,
query = data,
metric = "euclidean",
n.trees = 50,
k,
search.k = -1,
include.distance = TRUE,
index = NULL
) {
idx <- index %||% AnnoyBuildIndex(
data = data,
metric = metric,
n.trees = n.trees)
nn <- AnnoySearch(
index = idx,
query = query,
k = k,
search.k = search.k,
include.distance = include.distance)
nn$idx <- idx
nn$alg.info <- list(metric = metric, ndim = ncol(x = data))
return(nn)
}
# Build the annoy index
#
# @param data Data to build the index with
# @param metric Distance metric; can be one of "euclidean", "cosine", "manhattan",
# "hamming"
# @param n.trees More trees gives higher precision when querying
#
#' @importFrom RcppAnnoy AnnoyEuclidean AnnoyAngular AnnoyManhattan AnnoyHamming
#
AnnoyBuildIndex <- function(data, metric = "euclidean", n.trees = 50) {
f <- ncol(x = data)
a <- switch(
EXPR = metric,
"euclidean" = new(Class = RcppAnnoy::AnnoyEuclidean, f),
"cosine" = new(Class = RcppAnnoy::AnnoyAngular, f),
"manhattan" = new(Class = RcppAnnoy::AnnoyManhattan, f),
"hamming" = new(Class = RcppAnnoy::AnnoyHamming, f),
stop ("Invalid metric")
)
for (ii in seq(nrow(x = data))) {
a$addItem(ii - 1, data[ii, ])
}
a$build(n.trees)
return(a)
}
# Search an Annoy approximate nearest neighbor index
#
# @param Annoy index, built with AnnoyBuildIndex
# @param query A set of data to be queried against the index
# @param k Number of neighbors
# @param search.k During the query it will inspect up to search_k nodes which
# gives you a run-time tradeoff between better accuracy and speed.
# @param include.distance Include the corresponding distances in the result
#
# @return A list with 'nn.idx' (for each element in 'query', the index of the
# nearest k elements in the index) and 'nn.dists' (the distances of the nearest
# k elements)
#
#' @importFrom future plan
#' @importFrom future.apply future_lapply
#
AnnoySearch <- function(index, query, k, search.k = -1, include.distance = TRUE) {
n <- nrow(x = query)
idx <- matrix(nrow = n, ncol = k)
dist <- matrix(nrow = n, ncol = k)
convert <- methods::is(index, "Rcpp_AnnoyAngular")
if (!inherits(x = plan(), what = "multicore")) {
oplan <- plan(strategy = "sequential")
on.exit(plan(oplan), add = TRUE)
}
res <- future_lapply(X = 1:n, FUN = function(x) {
res <- index$getNNsByVectorList(query[x, ], k, search.k, include.distance)
# Convert from Angular to Cosine distance
if (convert) {
res$dist <- 0.5 * (res$dist * res$dist)
}
list(res$item + 1, res$distance)
})
for (i in 1:n) {
idx[i, ] <- res[[i]][[1]]
if (include.distance) {
dist[i, ] <- res[[i]][[2]]
}
}
return(list(nn.idx = idx, nn.dists = dist))
}
- rann包
KD树法—大规模点集
这个包只有一个函数nn2。功能十分强大,支持很高的维数(文档里说20维以上可能不太好了,但还是可以试试),返回值包括了k级最邻近点的序号和k级最邻近点距离两个内容,非常实用。由于实际上调用的是c++库,所以速度非常快,我用了30w+点的点集,瞬间完成。
这包说明详见RANN package - RDocumentation
Finds the k nearest neighbours for every point in a given dataset in O(N log N) time using Arya and Mount's ANN library (v1.1.3). There is support for approximate as well as exact searches, fixed radius searches and bd as well as kd trees.
使用Arya和Mount的ANN库(v1.1.3)在O(N log N)时间内为给定数据集中的每个点查找k个最近的邻居。支持近似和精确搜索,固定半径搜索,bd和kd树。
This package implements nearest neighbors for the Euclidean (L2) metric. For the Manhattan (L1) metric, install the RANN1 package.
这个包实现了欧几里德(L2)度量的最近邻。对于曼哈顿(L1)指标,安装RANN1软件包。
library(RANN)
p_co=st_coordinates(point_set)
#获取点坐标。由于这个包只返回欧式距离,
不是专门针对地理空间开发的,
所以在计算前还是先把点坐标转换成投影坐标系为宜。
d=nn2(p_co,p_co,k=2)
#k=2即为设定K级最邻近点,
因为这个包只支持计算两个点集之间的最邻近点关系,所以只能计算包含自身的2级最邻近点。
该函数还有其他参数可以设定搜寻方法、查寻树的类型、搜索半径等。
d_indice=d[[1]][,2]
#返回值包括两个list元素,
第一个元素就是最邻近点序号矩阵。
第二列就是我们要找的最邻近点序号。
'RANN'提供一个快速的最近邻搜索算法,能够将算法的时间复杂度缩短为O(n log n)。
L2正则化:
L1正则化就是在loss function后边所加正则项为L1范数,加上L1范数容易得到稀疏解(0比较多)。L2正则化就是loss function后边所加正则项为L2范数的平方,加上L2正则相比于L1正则来说,得到的解比较平滑(不是稀疏),但是同样能够保证解中接近于0(但不是等于0,所以相对平滑)的维度比较多,降低模型的复杂度。