hello,这一篇我们来对软件changeo进行补充,很多细节的内容,我们需要深入知道一下。
首先第一个AIRR Community Rearrangement,这个直译过来就是gene重排,我们来看一下
One of the primary initiatives of the Adaptive Immune Receptor Repertoire (AIRR) Community has been to develop a set of metadata standards for the submission of AIRR sequencing datasets. (说白了,获得性免疫的主要结构是什么)
This work has been carried out by the AIRR Community Minimal Standards Working Group. In order to support reproducibility, standard quality control, and data deposition in a common repository, the AIRR Community has agreed to six high-level data sets that will guide the publication, curation and sharing of AIRR-Seq data and metadata: Study and subject, sample collection, sample processing and sequencing, raw sequences, processing of sequence data, and processed AIRR sequences. The detailed data elements within these sets are defined here (Download as TSV).
当然,这个地方还是需要更多的注释。我们简单了解一下即可。
第二部分,BCR聚类阈值的选择
首先是Distance to nearest neighbor
估计用于划分克隆相关序列的最佳距离阈值是通过计算从数据集中每个序列到其最近邻居的距离并在所得双峰分布中找到将克隆相关序列与无关序列分开的断点来完成的。 这是通过以下步骤完成的:
- Calculating of the nearest neighbor distances for each sequence.
- Generating a histogram of the nearest neighbor distances followed by either manual inspect for the threshold separating the two modes or automated threshold detection.
我们看一下distance是怎么计算的
Calculating the nearest neighbor distances requires the following fields (columns) to be present in the table:
- sequence_id
- v_call
- j_call
- junction
- junction_length
# Subset example data to one sample
library(shazam)
data(ExampleDb, package="alakazam")
Calculating nearest neighbor distances (heavy chain sequences)(最为核心的部分)
By default, distToNearest, the function for calculating distance between every sequence and its nearest neighbor, assumes that it is running under non-single-cell mode and that every input sequence is a heavy chain sequence and will be used for calculation(计算距离只采用了重链序列,TCR也是只采用了β链的CDR3),需要一些参数来调整distance的measure方式。如果使用 tigger
包(也是一个很经典的BCR分析软件,这个我们下一个分享)中的方法推断出基因型,并且已将 v_call_genotyped
添加到数据库中,则可以通过指定 vCallColumn
参数使用此列代替默认的 v_call
列。这将允许使用来自 tigger
的更准确的 V call来对序列进行分组。此外,为了更“宽松”地处理不明确的 V(D)J 段调用,可以将参数 first
设置为 FALSE
。设置 first=FALSE
将使用所有可能基因的联合来分组序列,而不是 field中的第一个基因。modal
参数确定使用哪个基础 SHM 模型来计算距离。The default model is single nucleotide Hamming distance with gaps considered as a match to any nucleotide (ham). 其他选项包括类似于转换/颠换模型 (hh_s1f) 的人类 Ig 特异性单核苷酸模型和来自 Yaari 等人,2013 年 (hh_s5f) 的相应 5 聚体上下文模型,来自 Cui 等人的一对类似的小鼠特异性模型, 2016 (mk_rs1nf 和 mk_rs5nf),以及氨基酸汉明距离 (aa)。
Note: Human and mouse distance measures that are backward compatible with SHazaM v0.1.4 and Change-O v0.3.3 are also provided as hs1f_compat and m1n_compat, respectively.
对于不对称的模型(例如,A 到 B 的距离不等于 B 到 A 的距离),有一个对称参数允许用户指定是使用两个距离的平均值还是最小值来确定 总距离。
# Use nucleotide Hamming distance and normalize by junction length
dist_ham <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="ham", normalize="len", nproc=1)
# Use genotyped V assignments, a 5-mer model and no normalization
dist_s5f <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="hh_s5f", normalize="none", nproc=1)
Calculating nearest neighbor distances (single-cell paired heavy and light chain sequences)
distToNearest
函数还支持在单细胞模式下运行,其中提供了包含单细胞配对 IGH:IGK/IGL、TRB:TRA 或 TRD:TRG 链序列的输入 Example10x。 在这种情况下,默认情况下,细胞首先被分成包含相同重/长链(IGH、TRB、TRD)V 基因和 J 基因(如果指定,连接长度)和相同轻/短链(IGK , IGL, TRA, TRG) V 基因和 J 基因(如果指定,连接长度)。 然后,仅使用重链序列来计算最近邻距离。
在单细胞模式下,输入 Example10x
的每一行都应该代表一个序列/链。 来自同一cell的序列/链由 cellIdColumn
列中的cell ID 链接。 请注意,一个cell应该恰好有一个 IGH 序列 (BCR) 或 TRB/TRD (TCR)。 locusColumn
列中的值必须是 IGH、IGI、IGK 或 IGL (BCR) 或 TRA、TRB、TRD 或 TRG (TCR) 之一。 要调用单细胞模式,必须指定 cellIdColumn
并且 locusColumn
必须正确。
可以选择分组是作为一个阶段的过程还是两个阶段的过程来完成。 这可以通过 VJthenLen
指定。 在一个阶段的过程中(VJthenLen=FALSE),细胞被分成包含相同重/长链V基因、J基因和连接长度(V-J-长度组合)和相同轻链V-J-长度组合的分区。 在两阶段过程中(VJthenLen=TRUE),cells are first divided by heavy/long chain V gene and J gene (V-J combination), and light/short chain V-J combination; 然后通过相应的junction lengths。
There is also a choice of whether grouping should be done using IGH (BCR) or TRB/TRD (TCR) sequences only, or using both IGH and IGK/IGL (BCR) or TRB/TRD and TRA/TRG (TCR) sequences. This is governed by onlyHeavy.(这个单链做更好一点)
# Single-cell mode
# Group cells in a one-stage process (VJthenLen=FALSE) and using
# both heavy and light chain sequences (onlyHeavy=FALSE)
data(Example10x, package="alakazam")
dist_sc <- distToNearest(Example10x, cellIdColumn="cell_id", locusColumn="locus",
VJthenLen=FALSE, onlyHeavy=FALSE)
无论是仅使用重链序列进行分组,还是使用重链和轻链序列进行分组,都将仅使用重链序列来计算最近邻距离。 因此,在单细胞模式下,返回的 data.frame
中轻链序列对应的行将在 dist_nearest
字段中具有 NA
。 (这个不重要)。
Using nearest neighbor distances to determine clonal assignment thresholds(选择阈值)
SHazaM 中距离最近计算的主要用途是使用 Change-O 中的 DefineClones
工具确定克隆分配的最佳阈值。 定义阈值依赖于区分克隆相关序列(由具有近邻的序列表示)和单例(没有近邻的序列),后者在最近邻距离直方图中显示为两种模式。
可以通过检查最近邻直方图或使用 findThreshold 函数提供的自动阈值检测算法之一来手动确定阈值。 可用的方法是密度(平滑密度)和 gmm(伽马/高斯混合模型),并通过 findThreshold 的方法参数进行选择。
通过人工检查确定阈值
手动阈值检测只涉及为 distToNearest 输出的 dist_nearest 列中的值生成直方图,并在两种模式之间的valley内选择合适的值。
# Generate Hamming distance histogram
library(ggplot2)
p1 <- ggplot(subset(dist_ham, !is.na(dist_nearest)),
aes(x=dist_nearest)) +
theme_bw() +
xlab("Hamming distance") +
ylab("Count") +
scale_x_continuous(breaks=seq(0, 1, 0.1)) +
geom_histogram(color="white", binwidth=0.02) +
geom_vline(xintercept=0.12, color="firebrick", linetype=2)
plot(p1)
By manual inspection, the length normalized ham model distance threshold would be set to a value near 0.12 in the above example.
# Generate HH_S5F distance histogram
p2 <- ggplot(subset(dist_s5f, !is.na(dist_nearest)),
aes(x=dist_nearest)) +
theme_bw() +
xlab("HH_S5F distance") +
ylab("Count") +
scale_x_continuous(breaks=seq(0, 50, 5)) +
geom_histogram(color="white", binwidth=1) +
geom_vline(xintercept=7, color="firebrick", linetype=2)
plot(p2)
In this example, the unnormalized hh_s5f model distance threshold would be set to a value near 7
Automated threshold detection via smoothed density(机器选择)
The density
method will look for the minimum in the valley between two modes of a smoothed distribution based on the input vector (distances
), which will generally be the dist_nearest
column from the distToNearest
output. Below is an example of using the density
method for threshold detection.
# Find threshold using density method
output <- findThreshold(dist_ham$dist_nearest, method="density")
threshold <- output@threshold
# Plot distance histogram, density estimate and optimum threshold
plot(output, title="Density Method")
# Print threshold
print(output)
## [1] 0.1738391
Automated threshold detection via a mixture model
findThreshold 函数包括用于自动确定克隆分配阈值的方法。 findThreshold (method="gmm") 的“gmm”方法(伽马/高斯混合方法)使用单变量密度分布函数的四种组合之一对最近距离分布执行最大似然拟合程序:“norm- norm”(两个高斯分布)、“norm-gamma”(下高斯和上伽马分布)、“gamma-norm”(下伽马和上高斯分布)和“gamma-gamma”(两个伽马分布)。 默认情况下,将通过计算灵敏度和特异性的平均值达到其最大值的距离来选择阈值(cutoff="optimal")。 还提供了替代阈值选择标准,包括曲线交点 (cutoff="intersect")、用户定义的灵敏度 (cutoff="user", sen=x) 或用户定义的特异性 (cutoff="user", spc=x)
在下面的示例中,混合模型方法 (method="gmm") 用于通过拟合两个伽马分布 (model="gamma-gamma") 来找到用于分离克隆相关序列的最佳阈值。 下图中的红色虚线定义了灵敏度和特异性的平均值达到最大值的距离。
# Find threshold using gmm method
output <- findThreshold(dist_ham$dist_nearest, method="gmm", model="gamma-gamma")
# Plot distance histogram, Gaussian fits, and optimum threshold
plot(output, binwidth=0.02, title="GMM Method: gamma-gamma")
# Print threshold
print(output)
## [1] 0.1221371
注意:由 plotGmmThreshold 绘制的直方图的形状由 binwidth 参数控制。 意思是,bin 大小的任何变化都会改变分布的形式,而 gmm 方法完全独立于 bin 大小,只涉及实际输入数据。
Calculating nearest neighbor distances independently for subsets of data(相当于亚群分析)
The fields
argument to distToNearest
will split the input data.frame
into groups based on values in the specified fields (columns) and will treat them independently. For example, if the input data has multiple samples, then fields="sample_id"
would allow each sample to be analyzed separately.
In the previous examples we used a subset of the original example data. In the following example, we will use the two available samples,-1h
and +7d
, and will set fields="sample_id". This will reproduce previous results for sample -1h and add results for sample +7d.
dist_fields <- distToNearest(ExampleDb, model="ham", normalize="len",
fields="sample_id", nproc=1)
We can plot the nearest neighbor distances for the two samples:
# Generate grouped histograms
p4 <- ggplot(subset(dist_fields, !is.na(dist_nearest)),
aes(x=dist_nearest)) +
theme_bw() +
xlab("Grouped Hamming distance") +
ylab("Count") +
geom_histogram(color="white", binwidth=0.02) +
geom_vline(xintercept=0.12, color="firebrick", linetype=2) +
facet_grid(sample_id ~ ., scales="free_y")
plot(p4)
In this case, the threshold selected for -1h seems to work well for +7d as well.
Calculating nearest neighbor distances across groups rather than within a groups(相当于多样本联合分析)
指定 distToNearest
的cross
参数会强制跨组执行距离计算(多样本一起计算),这样每个序列的最近邻居将始终是不同组中的序列。 在下面的示例中,我们设置 cross="sample",它将数据分组为 -1h 和 +7d 样本子集。 因此,样本-1h 中序列的最近邻距离将被限制为样本+7d 中的最近序列,反之亦然。
dist_cross <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="ham", first=FALSE,
normalize="len", cross="sample_id", nproc=1)
# Generate cross sample histograms
p5 <- ggplot(subset(dist_cross, !is.na(cross_dist_nearest)),
aes(x=cross_dist_nearest)) +
theme_bw() +
xlab("Cross-sample Hamming distance") +
ylab("Count") +
geom_histogram(color="white", binwidth=0.02) +
geom_vline(xintercept=0.12, color="firebrick", linetype=2) +
facet_grid(sample_id ~ ., scales="free_y")
plot(p5)
This can provide a sense of overlap between samples or a way to compare within-sample variation to cross-sample variation.
大样本数据的处理,Speeding up pairwise-distance-matrix calculations with subsampling
The subsample
option in distToNearest
allows to speed up calculations and reduce memory usage.
If there are very large groups of sequences that share V call, J call and junction length, distToNearest
will need a lot of memory and it will take a long time to calculate all the distances. Without subsampling, in a large group of n=70,000 sequences distToNearest
calculates a nn distance matrix. With subsampling, e.g. to s=15,000, the distance matrix for the same group has size sn, and for each sequence in db, the distance value is calculated by comparing the sequence to the subsampled sequences from the same V-J-junction length group.
# Explore V-J-junction length groups sizes to use subsample
# Show the size of the largest groups
library(dplyr)
library(alakazam)
top_10_sizes <- ExampleDb %>%
group_by(junction_length) %>% # Group by junction length
do(alakazam::groupGenes(., first=TRUE)) %>% # Group by V and J call
mutate(GROUP_ID=paste(junction_length, vj_group, sep="_")) %>% # Create group ids
ungroup() %>%
group_by(GROUP_ID) %>% # Group by GROUP_ID
distinct(junction) %>% # Vount unique junctions per group
summarize(SIZE=n()) %>% # Get the size of the group
arrange(desc(SIZE)) %>% # Sort by decreasing size
select(SIZE) %>%
top_n(10) # Filter to the top 10
# Use 30 to subsample
# NOTE: This is a toy example. Subsampling to 30 sequence with real data is unwise
dist <- distToNearest(ExampleDb, sequenceColumn="junction",
vCallColumn="v_call_genotyped", jCallColumn="j_call",
model="ham",
first=FALSE, normalize="len",
subsample=30)
第三,就是上面提到的SHM模型,SHM targeting models(目标突变导致的距离度量)
靶向模型是特定突变的背景可能性,基于周围的序列上下文以及突变本身。 该模型是从数据中观察到的突变推断出来的。 然后可以将模型转换为距离函数,以根据观察到的突变的可能性比较给定数据集的 Ig 序列。 这是通过以下步骤完成的:
- Infer a substitution model, which is the likelihood of a base mutating to each other base given the microsequence context.
- Infer a mutability model, which is likelihood of a given base being mutated given the microsequence context and substitution model.
- Visualize the mutability model to identify hot and cold spots.
- Calculate a nucleotide distance matrix based on the underlying SHM models.
看看示例
A small example AIRR Rearrangement database is included in the alakazam package. Inferring a targeting model requires the following fields (columns) to be present in the table:
- sequence_id
- sequence_alignment
- germline_alignment_d_mask
- v_call
# Load example data
library(shazam)
data(ExampleDb, package="alakazam")
# Subset to IGHG for faster usage demonstration
db <- subset(ExampleDb, c_call == "IGHG")
Infer targeting model (substitution and mutability)
用于推断替代率的函数 (createSubstitutionMatrix) 计算从给定碱基到所有其他碱基的突变数,这些突变发生在数据集中所有 5 聚体基序的中心位置。 createSubstitutionMatrix 的模型参数允许用户指定是计算所有突变,还是仅计算无声突变以推断模型。 如果样本序列、种系序列和 V 调用的列名与 Change-O 默认值不同,它们也可以作为参数传入。 此外,multipleMutation 参数决定了对单个 5-mer 中多个突变的处理:independent 独立处理每个突变,并完全忽略具有多个突变的 5-mers。
# Create substitution model using silent mutations
sub_model <- createSubstitutionMatrix(db, model="s",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
vCallColumn="v_call")
用于推断可变性模型的函数 (createMutabilityMatrix) 计算数据集所有 5 聚体基序中的突变数量,并取决于推断的替代率。 与可用于推断替代率的参数类似的参数可用于调整此函数。
# Create mutability model using silent mutations
mut_model <- createMutabilityMatrix(db, sub_model, model="s",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
vCallColumn="v_call")
createMutabilityMatrix 创建一个 MutabilityModel 类的对象,该对象包含一个 1024 归一化可变性的命名数字向量。 用于估计 5 聚体变异性的沉默和替代突变的数量分别记录在 numMutS 和 numMutR 槽中。 率。 源槽包含一个命名向量,指示是否推断或测量了每个 5 聚体的可变性。 具有可变性值和派生源的 data.frame。
# Number of silent mutations used for estimating 5-mer mutabilities
mut_model@numMutS
# Number of replacement mutations used for estimating 5-mer mutabilities
mut_model@numMutR
# Mutability and source as a data.frame
head(as.data.frame(mut_model))
上述函数返回的推断替换和可变性模型仅解释了明确的 5 聚体。 但是,在某些情况下,用户可能需要具有模糊字符的 5 聚体突变的可能性(motif)。 上述每个函数都有一个相应的函数(extendSubstitutionMatrix 和 extendMutabilityMatrix)来扩展模型,通过对所有对应的明确 5-mers 求平均值来推断 5-mers 和 Ns。
# Extend models to include ambiguous 5-mers
sub_model <- extendSubstitutionMatrix(sub_model)
mut_model <- extendMutabilityMatrix(mut_model)
这些扩展的替换和可变性模型可用于创建整体 SHM 目标矩阵(createTargetingMatrix),它是可变性和替换的组合概率。
# Create targeting model matrix from substitution and mutability models
tar_matrix <- createTargetingMatrix(sub_model, mut_model)
通过使用单个函数 createTargetingModel 直接从数据集中推断 TargetingModel 对象,可以组合上述所有步骤。 同样,用于估计 5 聚体可变性的沉默和替代突变的数量也分别记录在 numMutS 和 numMutR slots中。
Additionally, it is generally appropriate to consider the mutations within a clone only once. Consensus sequences for each clone can be generated using the collapseClones function.
# Collapse sequences into clonal consensus
clone_db <- collapseClones(db, cloneColumn="clone_id",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
nproc=1)
# Create targeting model in one step using only silent mutations
# Use consensus sequence input and germline columns
model <- createTargetingModel(clone_db, model="s", sequenceColumn="clonal_sequence",
germlineColumn="clonal_germline", vCallColumn="v_call")
Visualize targeting model
数据集底层 SHM 可变性模型的可视化可用于研究热点和冷点基序。 可变率图上条形的长度对应于给定 5 聚体中给定碱基发生突变的可能性。 绘图函数 plotMutability 具有参数样式,用于指定 5 聚体可变率的刺猬图(圆形)或条形图显示。 如果仅需要特定碱基的可变性,则可以通过 核苷酸参数指定。
# Generate hedgehog plot of mutability model
plotMutability(model, nucleotides="A", style="hedgehog")
plotMutability(model, nucleotides="C", style="hedgehog")
# Generate bar plot of mutability model
plotMutability(model, nucleotides="G", style="bar")
plotMutability(model, nucleotides="T", style="bar")
最后,Calculate targeting distance matrix
在 Change-O pipeline中,hs5f 克隆方法依赖于推断的目标模型。 如果用户希望使用从他们的数据中推断出的靶向模型来分配序列之间的距离以进行克隆分组,则必须将观察到的 SHM 靶向率转换为距离。 calcTargetingDistance 函数返回每个 5 聚体与中心碱基的每个相应突变之间的距离矩阵。 也可以使用函数 writeTargetingDistance 生成此矩阵并将其直接写入文件。
# Calculate distance matrix
dist <- calcTargetingDistance(model)
真的内容太多了,生活很好,有你更好