hello,大家好,我们继续我们的TCR数据分析,这一专题会有非常多的内容,我们慢慢分享,文献在Quantifiable predictive features define epitopespecific T cell receptor repertoires,影响因子49(nature)。今天我们的任务还是要多学习一些基础的概念和算法。
TCRs from T cells that recognize the same pMHC epitope often share conserved sequence features, suggesting that it may be possible to predictively model epitope specificity(关于基因重排和抗原表位等相关的基础知识,在我的文章10X单细胞(10X空间转录组)TCR数据分析之TCR 内在调控潜力系统(TiRP)),这里强调的是对于相同的pMHC,TCR富集的序列会含有相同的motif,这个已经被无数的实验证实,所以,表明有可能对表位特异性进行预测建模。(这也是我们这个专题的终极目的)。
这里就需要我们上一篇提到的内容,如果对抗原富集后的TCR进行建模分析,首先a distance measure on the space of TCRs(TCR的距离度量) that permits clustering and visualization(这里的聚类和可视化与单细胞转录组不同), a robust repertoire diversity metric that accommodates the low number of paired public receptors observed when compared to single-chain analyses(允许少量的其他单链序列,毕竟寻找motif), and a distancebased classifier(分类器,这个在机器学习中非常常见) that can assign previously unobserved TCRs to characterized repertoires with robust sensitivity and specificity。
当然,具体的抗原表位富集后的TCR序列contains a clustered group of receptors that share core sequence similarities, together with a dispersed set of diverse ‘outlier’ sequences(这是很自然的,这些相似的序列必然拥有相同的motif,从而特异性的结合抗原表位)。通过识别核心序列中的共享基序,我们能够突出显示驱动 TCR 识别基本要素的关键保守残基。 (看来这里的序列还是氨基酸序列)。
这里我们测序得到的TCR序列,我们需要总结和分析的部分是include length, charge, and hydrophobicity of the CDR3 regions, clonal diversity (within individuals), and amino acid sequence sharing (across individuals) following well-established approaches to repertoire analysis。(建立的方法我们后面介绍,总之,很多指标需要我们深入分析,而不简简单单是基因序列,单细胞的TCR分析需要我们升级)。
Mean values for CDR3 length, charge, and hydrophobicity tightly clustered for the majority of the epitopes, and all CDR3 features showed substantially overlapping ranges(看来确实可以依据抗原富集来寻找起作用的motif)。
这里简单回顾一下作者的发现,(1)found negative correlations between CDR3 charge and peptide charge(CDR3的电荷和肽段电荷成反比,以及 CDR3 长度和肽长度之间)。表明电荷和长度互补可能在某些表位的 pMHC 识别中起作用(基础知识,了解即可)。(2)Whereas substantial levels of sharing or publicity were observed for individual chains(单链比较,很多都是一样的),当考虑配对的 αβ 受体时,观察到个体之间的共享水平较低(这一点很有意思,单链比较有大量的相同,而配对的双链却鲜有一致的,有意思)。
单细胞测TCR的作用,By using paired single-cell TCRαβ sequencing, we were able to determine whether V and J segment usage was correlated both within a chain (for example, Vα –Jα , Vβ –Jβ ) and across chains (for example, Vα –Vβ , Vα –Jβ).(寻找相关性)。
相对于没有进行抗原表位富集的TCR序列,病毒抗原表位识别后的TCR序列found varying degrees of dominance of single and pairwise gene associations。(这个也是在预料之中)。
- 图注:V and J gene segment usage and covariation in epitopespecific responses(V 和 J 基因片段使用和表位特异性反应中的协变). a, Gene segment usage and gene–gene pairing landscapes are illustrated using four vertical stacks(垂直堆叠) (one for each V and J segment) connected by curved paths whose thickness is proportional to the number of TCR clones with the respective gene pairing(就是桑基图) (each panel is labelled with the four gene segments atop their respective colour stacks and the epitope identifier in the top middle). Genes are coloured by frequency within the repertoire with a fixed colour sequence used throughout the manuscript which begins red (most frequent), green (second most frequent), blue, cyan, magenta, and black. The enrichment of gene segments relative to background frequencies is indicated by up or down arrows with an arrowhead number equal to the log2 of the fold change. b, Jensen–Shannon divergence(有关JS散度大家可以参考文章KL散度、JS散度、Wasserstein距离) between the observed gene frequency distributions and background frequencies, normalized by the mean Shannon entropy of the two distributions (higher values reflect stronger gene preferences). c, Adjusted mutual information of gene usage correlations between regions (higher values indicate more strongly covarying gene usage). The lower limits of the colour ranges in b and c were chosen to highlight significant changes。 A summary of the number of subjects, total number of TCR sequences
- 图注:Gene segment usage and gene–gene pairing landscapes are illustrated graphically using four vertical stacks (one for each V and J segment) connected by curved segments with thickness proportional to the number of TCRs with the respective gene pairing (each panel is labelled with the four gene segments atop their respective colour stacks and the epitope identifier in the top middle). Genes are coloured by frequency within the repertoire with a fixed colour sequence used throughout the manuscript which begins red (most frequent), green (second most frequent), blue, cyan, magenta, and black. Clonally expanded TCRs were reduced to a single data point for this analysis. The number of clones is indicated to the left of each panel. The enrichment of gene segments relative to background frequencies is indicated by up or down arrows, with each successive arrowhead corresponding to an additional twofold deviation (for example, one arrowhead = twofold enrichment, two arrowheads = fourfold enrichment).(和上图的表现形式一致)。
每个表位特异性反应的特征是单个基因的过度表达以及显着的基因配对偏好,这就为我们对单独的抗原表位进行建模寻找motif提供了理论依据。每个表位特异性基因频率分布和背景分布之间的 Jensen-Shannon 散度用于量化基因偏好的总大小 (这个需要我们有一点的算法基础)。We quantified the degree of gene usage covariation between pairs of segments using the adjusted mutual information score(这也是重要的一环)。
为了寻找motif,TCR的距离定义就需要排上用场了。(概念和计算原理上篇已经说过,CD3的惩罚更重)。
- 图注:2D kernel principal components analysis (PCA) projection of the TCRdist landscape coloured by Vα (left panel) and Vβ (right panel) gene usage. Three groups of receptors that correspond to TCR logos and clusters depicted in c are indicated with dashed ellipses.(单细胞都很常见的方法)
- 图注:Epitope-specific TCR landscapes were projected into two dimensions (2D) using kernel PCA analysis applied to the TCRdist distance matrix: TCRs with small TCRdist values tend to project to nearby points in 2D. The same 2D projection is shown in the four panels of each row, coloured by Vα , Jα , Vβ and Jβ gene segment usage (left to right, respectively). The colours are based on gene frequency in the projected repertoire and follow the same sequence used throughout the manuscript: in decreasing order, 1, red; 2, green; 3, blue; 4, cyan; 5, magenta; 6, black; followed by assorted colours for rare frequencies. A summary of number of subjects,
To complement these landscape projections, we performed based
clustering of the epitope-specific receptors and constructed hierarchical
distance trees(一个很好的分析软件,TCRdist)(It is important to note that clonal expansions are not reflected in these repertoire landscape analyses, as each unique receptor is included only once.),不计算重复),developed a TCR logo representation that summarizes the gene frequencies, CDR3 amino acid sequences, and inferred rearrangement(这个地方也需要注意,大家做过生化实验的应该都懂这个)。主要有一个cluster组成,其他的序列也是相似的结构,这就为我们寻找motif提供了便利。除了相似受体的核心cluster之外,每个repertoire还包含彼此明显不同的受体的多个区域。
structures of a set of TCRs as a tool to further annotate these clusters
- 图注:Average-linkage dendrogram of TCRdist receptor clusters coloured by generation probability, with TCR logos for selected receptor subsets (the branches enclosed in dashed boxes labelled with size of the TCR clusters). Each logo depicts the V- (left side) and J- (right side) gene frequencies, CDR3 amino acid sequences (middle), and inferred rearrangement structure (bottom bars coloured by source region, light grey for the V-region, dark grey for J, black for D, and red for N-insertions) of the grouped receptors. (n = 13 mice, 291 TCR clones.)
尽管 CDR3 序列保守性在 TCRdist 簇标识中很明显,但这些共享的 CDR3 残基中有许多直接来自 V 和 J 区的基因组序列,因此反映了观察到的基因使用偏差,为了寻找CDR3的motif序列,采用了递归搜索算法,identified sequence patterns that occur significantly more often in the observed receptors than in two V- and J-gene-matched background sets of receptor sequences(这需要结构生物学的只是了,知道的太少了,惭愧)。
- 注:Enriched CDR3 sequence motifs define key features of epitope specificity. The top-scoring CDR3α (left TCR logo) and CDR3β (right TCR logo) sequence motifs are shown for each repertoire. The motif sequence logo is shown at full height (top) and scaled (bottom) by per-column relative entropy to background frequencies derived from TCRs with matching gene-segment composition in order to highlight motif positions under selection. For three epitopes with solved ternary TCR–pMHC structures, the enriched motif positions are mapped onto the 3D structure: motif positions shown in green sticks; peptide in magenta; alpha (beta) chain in yellow (blue) cartoons; selected hydrogen bonds shown as dotted green lines。
propose that these statistically enriched, non-germline-encoded motifs have a critical role in mediating TCR recognition(应该是这样的),对TCR的蛋白结构分析也证明了这一点。所以我们对于TCR的序列分析,能够识别驱动 TCR 识别(抗原)essential elements的关键保守残基,这个分析,太重要了。
接下来应用 TCRdist 测量来定量评估表位特异性库中的受体多样性和density,采用了一个new diversity metric (TCRdiv) that generalizes Simpson’s diversity index(辛普森多样性指数,大家可以百度一下,看看这个指数) by capturing similarity among receptors in addition to exact identity, as Simpson’s diversity index is highly sensitive to sampling noise because of the relative rarity of observing identical αβ pairs among individuals。
Examination of TCRdiv scores for the analysed repertoires for single chains as well as paired receptors clarified trends seen in the earlier analyses(例如:the PB1 repertoire exhibited low diversity in the α -chain and high β -chain diversity)
如上所述,我们的landscape分析表明,每个repertoire都由一组或多组共享相似序列特征的cluster受体以及更多样化的离群cluster组成。考虑到cluster和发散的 TCR 的贡献,开发了一个特定于repertoires的最近邻评分(NN 距离),它捕获了每个受体周围的受体密度(计算为受体与其在repertoires中的最近邻受体之间的平均 TCRdist)。 Although variation across repertoires was apparent in the NN-distance distributions,大多数表位表现出近似双峰分布,其中一个具有低 NN 距离的受体峰代表受体分布的主要和密集采样的主要cluster,而具有更大 NN 距离的受体的第二个峰反映了异常受体。
为了确认这些非成簇受体的抗原特异性,把两个峰的受体提取出来,然后实验衡量binding特异性四聚体的能力(识别相应抗原的能力)。在每种情况下都确认了受体的反应性,表明这些不同的异常受体中至少有一些是legitimate,if unconventional, solutions to the problem of epitope specificity,部分解释了这种现象。
这个软件还有分类器的功能,帮助我们识别专有T细胞的motif,比如浸润肿瘤的TCR序列等等,非常有价值,今天的基础知识我们就到这里,下一篇我们分享软件TCRdist的算法和代码。
生活很好,有你更好