10X单细胞（10X空间转录组）TCR转录组联合数据分析之(8)neighbor graph analysis（CoNGA）

今天我们来分享10X单细胞和TCR联合分析的代码部分，之前的基础知识，大家要掌握，代码已经是最后一步了。当然，前面自己进行安装，示例数据在https://www.dropbox.com/s/r7rpsftbtxl89y5/conga_example_datasets_v1.zip，大家可以自己下载尝试分析。

1、加载

%matplotlib inline
import conga
import numpy as np
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt

from PIL import Image               # for looking at some of 
from IPython.display import display # the output images.

2、Input data, output paths(都是10X分析的结果文件，我们可以指定自己的分析文件)

# you might have to change these paths depending on what you want to analyze
gex_datafile = 'conga_example_datasets_v1/vdj_v1_hs_pbmc3_5gex_filtered_gene_bc_matrices_h5.h5'
gex_datatype = '10x_h5' # other possibilities right now: ['10x_mtx', 'h5ad'] (h5ad from scanpy)
tcr_datafile = 'conga_example_datasets_v1/vdj_v1_hs_pbmc3_t_filtered_contig_annotations.csv'
organism = 'human'

clones_file = 'vdj_v1_hs_pbmc3_clones.tsv'
outfile_prefix = 'hs_pbmc3_test1'

3、Setup for conga: make a TCRdist clones_file and compute kernel PCs（这里是处理TCR的数据）

# this creates the TCRdist 'clones file'
conga.tcrdist.make_10x_clones_file.make_10x_clones_file( tcr_datafile, organism, clones_file )

# this command will create another file with the kernel PCs for subsequent reading by conga
conga.preprocess.make_tcrdist_kernel_pcs_file_from_clones_file( clones_file, organism )

4、Read the data, create a scanpy AnnData object with everything inside（scanpy分析,看来函数已经封装好了）

adata = conga.preprocess.read_dataset(gex_datafile, gex_datatype, clones_file )

# store the organism info in adata
adata.uns['organism'] = organism

adata

我们看看里面有什么

# CDR3-alpha regions:
adata.obs['cdr3a'].head(3)
AAACCTGAGATCTGAA-1    CAASIGPLGTGTASKLTF
AAACCTGAGGAACTGC-1          CAASDNTDKLIF
AAACCTGAGGAGTCTG-1      CAVEANNAGNNRKLIW

氨基酸序列（注意我们在分析TCR的时候，也尽量转化成氨基酸序列进行分析）

5、Do some very basic scGEX filtering(这一步可选，但一般我们需要做一下，需要确保的内容如下)

adata.X is a dense matrix just containing the highly variable genes, normalized, log1ped, scaled, and ready for running PCA.（这个基本就是普通的处理）
The T cell receptor V and J genes have been eliminated from adata.X, so they don't influence the GEX clusters/neighbor graphs. (Or the IG receptor genes, if analyzing BCR data).（转录组信息需要单独分析）
'Antibody' features (ADT/surface protein/pMHC) have been eliminated from adata.X. See the run_conga.py argument --include_protein_features for an approach to including these additional features into the GEX graphs. They can still be present in adata.raw.X; if so they will be included in DEG analyses. (抗原特征也要排除).
adata.raw.X has been normalized (all rows sum to 10,000) and log1p'ed(常规操作).
Cells with too few or too many genes, or high mitochondrial expression, have been eliminated.(常规过滤)。
去除双细胞（这个需要其他软件来执行，大家应该都很熟悉了）。

adata = conga.preprocess.filter_and_scale( 
    adata, 
    min_genes_per_cell=200,
    max_genes_per_cell=2500,
    max_percent_mito=0.1,
)

6、Now we reduce to a single cell per TCR clonotype（寻找每个克隆的质心）

通过计算GEX距离并为每个ClOnotype挑选最多代表性的cell（与ClOnotype中的所有其他cell的最小平均距离的距离）来执行此操作。

adata = conga.preprocess.reduce_to_single_cell_per_clone(adata)

adata

7、Now that we've reduced to a single cell per clonotype, run clustering and dimensionality reduction for GEX and for TCR（聚类）

adata = conga.preprocess.cluster_and_tsne_and_umap( adata )

These features are stored in adata.obs and in adata.obsm

adata.obs['clusters_gex'] -- GEX clusters
adata.obs['clusters_tcr'] -- TCR clusters
adata.obsm['X_gex_2d'] -- GEX 2D landscape coordinates (from UMAP)
adata.obsm['X_tcr_2d'] -- TCR 2D landscape coordinates (from UMAP)
adata.obsm['X_pca_gex'] -- GEX principal components
adata.obsm['X_pca_tcr'] -- TCRdist kernel principal components(perfect)

adata

AnnData object with n_obs × n_vars = 2896 × 1022
    obs: 'va', 'ja', 'cdr3a', 'cdr3a_nucseq', 'vb', 'jb', 'cdr3b', 'cdr3b_nucseq', 'n_genes', 'percent_mito', 'n_counts', 'clone_sizes', 'gex_variation', 'leiden_gex', 'clusters_gex', 'leiden_tcr', 'clusters_tcr'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'conga_results', 'conga_stats', 'organism', 'log1p', 'hvg', 'raw_matrix_is_logged', 'pca', 'neighbors', 'umap', 'leiden'
    obsm: 'X_pca_tcr', 'X_pca_gex', 'X_umap_gex', 'X_gex_2d', 'X_umap_tcr', 'X_tcr_2d'
    varm: 'PCs'
    layers: 'scaled'
    obsp: 'distances', 'connectivities'

8、Make UMAP landscape plots colored by cluster assignments



plt.figure(figsize=(12,6))
plt.subplot(121)
xy = adata.obsm['X_gex_2d']
clusters = np.array(adata.obs['clusters_gex'])
cmap = plt.get_cmap('tab20')
colors = [ cmap.colors[x] for x in clusters]
plt.scatter( xy[:,0], xy[:,1], c=colors)
plt.title('GEX UMAP colored by GEX clusters')

plt.subplot(122)
xy = adata.obsm['X_tcr_2d']
clusters = np.array(adata.obs['clusters_tcr'])
cmap = plt.get_cmap('tab20')
colors = [ cmap.colors[x] for x in clusters]
plt.scatter( xy[:,0], xy[:,1], c=colors)
plt.title('TCR UMAP colored by TCR clusters');

图片.png

9、Compute the GEX and TCR neighbor sets(关键点来了)

The neighbors are stored as GEX,TCR tuples in the all_nbrs dictionary indexed by nbr_frac (neighbor fraction: nbr_frac = 0.1 means K-nearest neighbor graph with K = 0.1*num_clonotypes)
So, to unpack the GEX and TCR neighbor sets for a given value of nbr_frac, do:

nbrs_gex, nbrs_tcr = all_nbrs[nbr_frac]

# these are the nbrhood sizes, as a fraction of the entire dataset:
nbr_fracs = [0.01, 0.1]

# we use this nbrhood size for computing the nndists
nbr_frac_for_nndists = 0.01

all_nbrs, nndists_gex, nndists_tcr = conga.preprocess.calc_nbrs(
    adata, nbr_fracs, also_calc_nndists=True, nbr_frac_for_nndists=nbr_frac_for_nndists)

# stash these in obs array, they are used in a few places...
adata.obs['nndists_gex'] = nndists_gex
adata.obs['nndists_tcr'] = nndists_tcr

conga.preprocess.setup_tcr_cluster_names(adata) #stores in adata.uns

10、Run graph-vs-graph analysis（第一个分析）

This will also save the results (a pandas DataFrame) to a tsv file _graph_vs_graph.tsv and store them in the adata.uns['conga_results'] dictionary.

results = conga.correlations.run_graph_vs_graph(
    adata, all_nbrs, outfile_prefix=outfile_prefix)

results.head()

图片.png

conga results are stored in adata.uns['conga_results']

adata.uns['conga_results'].keys()

dict_keys(['graph_vs_graph', 'graph_vs_graph_help'])

# another way of getting the graph-vs-graph results DataFrame:
# conga.tags.GRAPH_VS_GRAPH = 'graph_vs_graph'
results = adata.uns['conga_results'][conga.tags.GRAPH_VS_GRAPH]
results.head()

图片.png

Some potentially useful statistics are stored in `adata.uns['conga_stats']`

图片.png

Make a scatterplot colored by conga score

#put the conga hits on top
conga_scores = adata.obs['conga_scores']
colors = np.sqrt(np.maximum(-1*np.log10(conga_scores),0.0))
reorder = np.argsort(colors)

plt.figure(figsize=(12,6))
plt.subplot(121)
xy = adata.obsm['X_gex_2d']
plt.scatter( xy[reorder,0], xy[reorder,1], c=colors[reorder], vmin=0, vmax=np.sqrt(5))
plt.title('GEX UMAP colored by conga score')

plt.subplot(122)
xy = adata.obsm['X_tcr_2d']
plt.scatter( xy[reorder,0], xy[reorder,1], c=colors[reorder], vmin=0, vmax=np.sqrt(5))
plt.title('TCR UMAP colored by conga score');

图片.png

Make the CoNGA cluster logos plot(重点来了)

This plot summarizes the CoNGA clusters of size at least 5 (min_cluster_size). Some useful options:

gex_header_genes: list of gene names for the thumbnail UMAP plots above the logos. Default is conga.plotting.default_gex_header_genes[organism]
gex_header_tcr_score_names: list of tcr scores to show in the thumbnail UMAP plots. Default is ['imhc', 'cdr3len', 'cd8', 'nndists_tcr']
logo_genes: python list of gene names for the 'GEX logo' shown on the right. Should have length = logo_gene_width*3 - 2. Default is conga.plotting.default_logo_genes[organism]
logo_gene_width: width of the 'GEX logo' (default is 6).
Note: These arguments also work for the conga.plotting.make_tcr_clumping_plots function below.

nbrs_gex, nbrs_tcr = all_nbrs[0.1]

min_cluster_size = 5

conga.plotting.make_graph_vs_graph_logos(
    adata,
    outfile_prefix,
    min_cluster_size,
    nbrs_gex,
    nbrs_tcr,
)

# or equivalently:
#
# conga.plotting.make_graph_vs_graph_logos(
#     adata,
#     outfile_prefix,
#     min_cluster_size,
#     nbrs_gex,
#     nbrs_tcr,
#     gex_header_genes = ['clone_sizes','CD4','CD8A','CD8B','SELL','GNLY','GZMA','CCL5','ZNF683','IKZF2','PDCD1','KLRB1'],
#     gex_header_tcr_score_names = ['imhc', 'cdr3len', 'cd8', 'nndists_tcr'],
#     logo_genes = ['CD4','CD8A','CD8B','CCR7','SELL','GNLY','PRF1','GZMA','IL7R','IKZF2','KLRD1','CCL5','ZNF683','KLRB1','NKG7','HLA-DRB1' ],
#     gene_logo_width = 6,
# )

图片.png

The image filenames are stored in adata.uns['conga_results']. This comes in handy for the html-summary generating code (see the end of this notebook).

from PIL import Image               # for looking at some of 
from IPython.display import display # the output images.

tag = conga.tags.GRAPH_VS_GRAPH_LOGOS
pngfile = adata.uns['conga_results'][tag]
help_message = adata.uns['conga_results'][tag+'_help']

print(help_message)
image = Image.open(pngfile)
display(image)

图片.png

我们普通保存就可以了。

11(第二个分析)，## Run graph-vs-feature analysis, comparing the TCR graph to GEX features (mostly gene expression) and the GEX graph to TCR features (CDR3 features and V/J gene segments)

First run the analysis, stashing the results in `adata.uns['conga_results']` and saving tables as `.tsv` files (since we pass in `outfile_prefix`)

conga.correlations.run_graph_vs_features(
    adata, all_nbrs, outfile_prefix=outfile_prefix)

After running the analysis we can make the graph-vs-features plots

conga.plotting.make_graph_vs_features_plots(
    adata, all_nbrs, outfile_prefix)

图片.png

12、TCR matching to database

Here we match the TCRs in the dataset to a database of paired TCR sequences of known epitope specificity (or to a user-provided database using the db_tcrs_tsvfile argument). The default database is conga/data/new_paired_tcr_db_for_matching_nr.tsv (see the readme file conga/data/new_paired_tcr_db_for_matching_nr_README.txt for information about where these TCR sequences come from. Major contributors are the VDJdb database, the McPAS database, the 10x 200k dataset, the Dash et al TCRdist dataset.)（这部分我们简单看看）。

match_results = conga.tcr_clumping.match_adata_tcrs_to_db_tcrs(
    adata, num_random_samples_for_bg_freqs=50000)

match_results.head()

图片.png

TCR clumping

This analysis identifies TCR neighborhoods that are overpopulated relative to a background expectation based on a simple model of V(D)J recombination. These TCR clumps or clusters may represent epitope-specific responses, invariant T cell populations, or other potentially interesting T cell subsets. This analysis does not directly use the GEX information, although we can subsequently look at GEX features of the identified clusters (see the tcr clumping logo plots below)

results = conga.tcr_clumping.assess_tcr_clumping(
    adata,
    outfile_prefix= outfile_prefix, # will save results as .tsv file
)

results.head()

图片.png

######## Make the tcr clumping plots

nbrs_gex, nbrs_tcr = all_nbrs[ max(nbr_fracs) ]

# now call plotting function, after results are stashed in adata
conga.plotting.make_tcr_clumping_plots(
    adata,
    nbrs_gex,
    nbrs_tcr,
    outfile_prefix,
    )

图片.png

13 Graph-vs-features analysis using the Yosef lab's HotSpot algorithm(热点算法)

DeTomaso, D., & Yosef, N. (2021). Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell Systems, 12(5), 446–456.e9.

Here we use the HotSpot algorithm to identify TCR features that correlate with the GEX neighbor graph, or GEX features (expression of individual genes) that correlate with the TCR neighbor graph. The basic idea is that if two clonotypes are neighbors in the graph, then their feature values are correlated. This can be more sensitive than our original graph-vs-features analysis, which looks at the feature score distribution in each clonotype's neighborhood individually, rather than across the entire graph (using the HotSpot H-statistic). On the other hand, features that are elevated in a very smal subpopulation may get a more significant score in the original CoNGA graph-vs-features analysis. So it's worth looking at both analyses.

This is a 'beta' version that does not incorporate all the cleverness in the HotSpot algorithm. We find that neighbor graphs with small neighborhoods (e.g., nbr_fracs < 0.02) sometimes give less interpretable/robust results. So here we are using only the 0.1 nbr_frac (ie, the K nearest neighbor graph with K = 0.1 * num_clonotypes).

results = conga.correlations.find_hotspots_wrapper(
    adata, all_nbrs, nbr_fracs = [0.1], outfile_prefix=outfile_prefix)

results.head(10)

图片.png

conga.plotting.make_hotspot_plots(
    adata, all_nbrs, outfile_prefix,
    )

图片.png

####### CoNGA can generate a summary .html file using the results in adata.uns['conga_results'](生成网页报告).

html_file = outfile_prefix+'_results_summary.html'
conga.plotting.make_html_summary(adata, html_file)

display the summary html output here in the notebook

from IPython.display import display, HTML

html_lines = open(html_file,'r').read()

display(HTML(html_lines))

最后我们来总结一下

1、graph_vs_graph

Graph vs graph analysis looks for correlation between GEX and TCR space by finding statistically significant overlap between two similarity graphs, one defined by GEX similarity and one by TCR sequence similarity.

Overlap is defined one node (clonotype) at a time by looking for overlap between that node's neighbors in the GEX graph and its neighbors in the TCR graph. The null model is that the two neighbor sets are chosen independently at random.

CoNGA looks at two kinds of graphs: K nearest neighbor (KNN) graphs, where K = neighborhood size is specified as a fraction of the number of clonotypes (defaults for K are 0.01 and 0.1), and cluster graphs, where each clonotype is connected to all the other clonotypes in the same (GEX or TCR) cluster. Overlaps are computed 3 ways (GEX KNN vs TCR KNN, GEX KNN vs TCR cluster, and GEX cluster vs TCR KNN), for each of the K values (called nbr_fracs short for neighbor fractions). （前面都已经分享过）。

Columns (depend slightly on whether hit is KNN v KNN or KNN v cluster): conga_score = P value for GEX/TCR overlap * number of clonotypes mait_fraction = fraction of the overlap made up of 'invariant' T cells num_neighbors* = size of neighborhood (K) cluster_size = size of cluster (for KNN v cluster graph overlaps) clone_index = 0-index of clonotype in adata object

图片.png

2、graph_vs_graph_logos

This figure summarizes the results of a CoNGA analysis that produces scores (CoNGA) and clusters. At the top are six 2D UMAP projections of clonotypes in the dataset based on GEX similarity (top left three panels) and TCR similarity (top right three panels), colored from left to right by GEX cluster assignment; CoNGA score; joint GEX:TCR cluster assignment for clonotypes with significant CoNGA scores, using a bicolored disk whose left half indicates GEX cluster and whose right half indicates TCR cluster; TCR cluster; CoNGA; GEX:TCR cluster assignments for CoNGA hits, as in the third panel.

Below are two rows of GEX landscape plots colored by (first row, left) expression of selected marker genes, (second row, left) Z-score normalized and GEX-neighborhood averaged expression of the same marker genes, and (both rows, right) TCR sequence features (see CoNGA manuscript Table S3 for TCR feature descriptions).

GEX and TCR sequence features of CoNGA hits in clusters with 5 or more hits are summarized by a series of logo-style visualizations, from left to right: differentially expressed genes (DEGs); TCR sequence logos showing the V and J gene usage and CDR3 sequences for the TCR alpha and beta chains; biased TCR sequence scores, with red indicating elevated scores and blue indicating decreased scores relative to the rest of the dataset (see CoNGA manuscript Table S3 for score definitions); GEX 'logos' for each cluster consisting of a panel of marker genes shown with red disks colored by mean expression and sized according to the fraction of cells expressing the gene (gene names are given above).

DEG and TCRseq sequence logos are scaled by the adjusted P value of the associations, with full logo height requiring a top adjusted P value below 10-6. DEGs with fold-change less than 2 are shown in gray. Each cluster is indicated by a bicolored disk colored according to GEX cluster (left half) and TCR cluster (right half). The two numbers above each disk show the number of hits within the cluster (on the left) and the total number of cells in those clonotypes (on the right). The dendrogram at the left shows similarity relationships among the clusters based on connections in the GEX and TCR neighbor graphs.

The choice of which marker genes to use for the GEX umap panels and for the cluster GEX logos can be configured using run_conga.py command line flags or arguments to the conga.plotting.make_logo_plots function.

3、tcr_clumping

This table stores the results of the TCR "clumping" analysis, which looks for neighborhoods in TCR space with more TCRs than expected by chance under a simple null model of VDJ rearrangement.

For each TCR in the dataset, we count how many TCRs are within a set of fixed TCRdist radii (defaults: 24,48,72,96), and compare that number to the expected number given the size of the dataset using the poisson model. Inspired by the ALICE and TCRnet methods.

Columns: clump_type='global' unless we are optionally looking for TCR clumps within the individual GEX clusters num_nbrs = neighborhood size (number of other TCRs with TCRdist

图片.png

4、tcr_clumping_logos

This figure summarizes the results of a CoNGA analysis that produces scores (TCR clumping) and clusters. At the top are six 2D UMAP projections of clonotypes in the dataset based on GEX similarity (top left three panels) and TCR similarity (top right three panels), colored from left to right by GEX cluster assignment; TCR clumping score; joint GEX:TCR cluster assignment for clonotypes with significant TCR clumping scores, using a bicolored disk whose left half indicates GEX cluster and whose right half indicates TCR cluster; TCR cluster; TCR clumping; GEX:TCR cluster assignments for TCR clumping hits, as in the third panel.

Below are two rows of GEX landscape plots colored by (first row, left) expression of selected marker genes, (second row, left) Z-score normalized and GEX-neighborhood averaged expression of the same marker genes, and (both rows, right) TCR sequence features (see CoNGA manuscript Table S3 for TCR feature descriptions).

GEX and TCR sequence features of TCR clumping hits in clusters with 3 or more hits are summarized by a series of logo-style visualizations, from left to right: differentially expressed genes (DEGs); TCR sequence logos showing the V and J gene usage and CDR3 sequences for the TCR alpha and beta chains; biased TCR sequence scores, with red indicating elevated scores and blue indicating decreased scores relative to the rest of the dataset (see CoNGA manuscript Table S3 for score definitions); GEX 'logos' for each cluster consisting of a panel of marker genes shown with red disks colored by mean expression and sized according to the fraction of cells expressing the gene (gene names are given above).

DEG and TCRseq sequence logos are scaled by the adjusted P value of the associations, with full logo height requiring a top adjusted P value below 10-6. DEGs with fold-change less than 2 are shown in gray. Each cluster is indicated by a bicolored disk colored according to GEX cluster (left half) and TCR cluster (right half). The two numbers above each disk show the number of hits within the cluster (on the left) and the total number of cells in those clonotypes (on the right). The dendrogram at the left shows similarity relationships among the clusters based on connections in the GEX and TCR neighbor graphs.

The choice of which marker genes to use for the GEX umap panels and for the cluster GEX logos can be configured using run_conga.py command line flags or arguments to the conga.plotting.make_logo_plots function.

4、tcr_db_match

This table stores significant matches between TCRs in adata and TCRs in the file conga/conga/data/new_paired_tcr_db_for_matching_nr.tsv

P values of matches are assigned by turning the raw TCRdist score into a P value based on a model of the V(D)J rearrangement process, so matches between TCRs that are very far from germline (for example) are assigned a higher significance.

Columns:

tcrdist: TCRdist distance between the two TCRs (adata query and db hit)
pvalue_adj: raw P value of the match * num query TCRs * num db TCRs
fdr_value: Benjamini-Hochberg FDR value for match
clone_index: index within adata of the query TCR clonotype
db_index: index of the hit in the database being matched
va,ja,cdr3a,vb,jb,cdr3b
db_XXX: where XXX is a field in the literature database

图片.png

5、tcr_graph_vs_gex_features

This table has results from a graph-vs-features analysis in which we look for genes that are differentially expressed (elevated) in specific neighborhoods of the TCR neighbor graph. Differential expression is assessed by a ttest first, for speed, and then by a mannwhitneyu test for nbrhood/score combinations whose ttest P-value passes an initial threshold (default is 10* the pvalue threshold).

Each row of the table represents a single significant association, in other words a neighborhood (defined by the central clonotype index) and a gene.

The columns are as follows:

ttest_pvalue_adj= ttest_pvalue * number of comparisons mwu_pvalue_adj= mannwhitney-U P-value * number of comparisons log2enr = log2 fold change of gene in neighborhood (will be positive) gex_cluster= the consensus GEX cluster of the clonotypes w/ biased scores tcr_cluster= the consensus TCR cluster of the clonotypes w/ biased scores num_fg= the number of clonotypes in the neighborhood (including center) mean_fg= the mean value of the feature in the neighborhood mean_bg= the mean value of the feature outside the neighborhood feature= the name of the gene mait_fraction= the fraction of the skewed clonotypes that have an invariant TCR clone_index= the index in the anndata dataset of the clonotype that is the center of the neighborhood.

图片.png

6、tcr_genes_vs_gex_features

This table has results from a graph-vs-features analysis in which we look for genes that are differentially expressed (elevated) in specific neighborhoods of the TCR neighbor graph. Differential expression is assessed by a ttest first, for speed, and then by a mannwhitneyu test for nbrhood/score combinations whose ttest P-value passes an initial threshold (default is 10* the pvalue threshold).

Each row of the table represents a single significant association, in other words a neighborhood (defined by the central clonotype index) and a gene.

The columns are as follows:

ttest_pvalue_adj= ttest_pvalue * number of comparisons mwu_pvalue_adj= mannwhitney-U P-value * number of comparisons log2enr = log2 fold change of gene in neighborhood (will be positive) gex_cluster= the consensus GEX cluster of the clonotypes w/ biased scores tcr_cluster= the consensus TCR cluster of the clonotypes w/ biased scores num_fg= the number of clonotypes in the neighborhood (including center) mean_fg= the mean value of the feature in the neighborhood mean_bg= the mean value of the feature outside the neighborhood feature= the name of the gene mait_fraction= the fraction of the skewed clonotypes that have an invariant TCR clone_index= the index in the anndata dataset of the clonotype that is the center of the neighborhood.

In this analysis the TCR graph is defined by connecting all clonotypes that have the same VA/JA/VB/JB-gene segment (it's run four times, once with each gene segment type)

图片.png

7、gex_graph_vs_tcr_features

This table has results from a graph-vs-features analysis in which we look at the distribution of a set of TCR-defined features over the GEX neighbor graph. We look for neighborhoods in the graph that have biased score distributions, as assessed by a ttest first, for speed, and then by a mannwhitneyu test for nbrhood/score combinations whose ttest P-value passes an initial threshold (default is 10* the pvalue threshold).

Each row of the table represents a single significant association, in other words a neighborhood (defined by the central clonotype index) and a tcr feature.

The columns are as follows:

ttest_pvalue_adj= ttest_pvalue * number of comparisons ttest_stat= ttest statistic (sign indicates where feature is up or down) mwu_pvalue_adj= mannwhitney-U P-value * number of comparisons gex_cluster= the consensus GEX cluster of the clonotypes w/ biased scores tcr_cluster= the consensus TCR cluster of the clonotypes w/ biased scores num_fg= the number of clonotypes in the neighborhood (including center) mean_fg= the mean value of the feature in the neighborhood mean_bg= the mean value of the feature outside the neighborhood feature= the name of the TCR score mait_fraction= the fraction of the skewed clonotypes that have an invariant TCR clone_index= the index in the anndata dataset of the clonotype that is the center of the neighborhood.

图片.png

真的非常难，一遍很难全部搞懂，慢慢多来几遍

生活很好，有你更好

10X单细胞（10X空间转录组）TCR转录组联合数据分析之(8)neighbor graph analysis（CoNGA）

1、加载

2、Input data, output paths(都是10X分析的结果文件，我们可以指定自己的分析文件)

3、Setup for conga: make a TCRdist clones_file and compute kernel PCs（这里是处理TCR的数据）

4、Read the data, create a scanpy AnnData object with everything inside（scanpy分析,看来函数已经封装好了）

我们看看里面有什么

氨基酸序列（注意我们在分析TCR的时候，也尽量转化成氨基酸序列进行分析）

5、Do some very basic scGEX filtering(这一步可选，但一般我们需要做一下，需要确保的内容如下)

6、Now we reduce to a single cell per TCR clonotype（寻找每个克隆的质心）

通过计算GEX距离并为每个ClOnotype挑选最多代表性的cell（与ClOnotype中的所有其他cell的最小平均距离的距离）来执行此操作。

7、Now that we've reduced to a single cell per clonotype, run clustering and dimensionality reduction for GEX and for TCR（聚类）

These features are stored in adata.obs and in adata.obsm

8、Make UMAP landscape plots colored by cluster assignments

9、Compute the GEX and TCR neighbor sets(关键点来了)

10、Run graph-vs-graph analysis（第一个分析）

conga results are stored in adata.uns['conga_results']

Some potentially useful statistics are stored in adata.uns['conga_stats']

Make a scatterplot colored by conga score

Make the CoNGA cluster logos plot(重点来了)

This plot summarizes the CoNGA clusters of size at least 5 (min_cluster_size). Some useful options:

The image filenames are stored in adata.uns['conga_results']. This comes in handy for the html-summary generating code (see the end of this notebook).

我们普通保存就可以了。

11(第二个分析)，## Run graph-vs-feature analysis, comparing the TCR graph to GEX features (mostly gene expression) and the GEX graph to TCR features (CDR3 features and V/J gene segments)

First run the analysis, stashing the results in adata.uns['conga_results'] and saving tables as .tsv files (since we pass in outfile_prefix)

After running the analysis we can make the graph-vs-features plots

12、TCR matching to database

TCR clumping

13 Graph-vs-features analysis using the Yosef lab's HotSpot algorithm(热点算法)

DeTomaso, D., & Yosef, N. (2021). Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell Systems, 12(5), 446–456.e9.

display the summary html output here in the notebook

最后我们来总结一下

1、graph_vs_graph

Graph vs graph analysis looks for correlation between GEX and TCR space by finding statistically significant overlap between two similarity graphs, one defined by GEX similarity and one by TCR sequence similarity.

Overlap is defined one node (clonotype) at a time by looking for overlap between that node's neighbors in the GEX graph and its neighbors in the TCR graph. The null model is that the two neighbor sets are chosen independently at random.

2、graph_vs_graph_logos

The choice of which marker genes to use for the GEX umap panels and for the cluster GEX logos can be configured using run_conga.py command line flags or arguments to the conga.plotting.make_logo_plots function.

3、tcr_clumping

This table stores the results of the TCR "clumping" analysis, which looks for neighborhoods in TCR space with more TCRs than expected by chance under a simple null model of VDJ rearrangement.

For each TCR in the dataset, we count how many TCRs are within a set of fixed TCRdist radii (defaults: 24,48,72,96), and compare that number to the expected number given the size of the dataset using the poisson model. Inspired by the ALICE and TCRnet methods.

Columns: clump_type='global' unless we are optionally looking for TCR clumps within the individual GEX clusters num_nbrs = neighborhood size (number of other TCRs with TCRdist

4、tcr_clumping_logos

The choice of which marker genes to use for the GEX umap panels and for the cluster GEX logos can be configured using run_conga.py command line flags or arguments to the conga.plotting.make_logo_plots function.

4、tcr_db_match

This table stores significant matches between TCRs in adata and TCRs in the file conga/conga/data/new_paired_tcr_db_for_matching_nr.tsv

P values of matches are assigned by turning the raw TCRdist score into a P value based on a model of the V(D)J rearrangement process, so matches between TCRs that are very far from germline (for example) are assigned a higher significance.

Columns:

5、tcr_graph_vs_gex_features

Each row of the table represents a single significant association, in other words a neighborhood (defined by the central clonotype index) and a gene.

The columns are as follows:

6、tcr_genes_vs_gex_features

Each row of the table represents a single significant association, in other words a neighborhood (defined by the central clonotype index) and a gene.

The columns are as follows:

In this analysis the TCR graph is defined by connecting all clonotypes that have the same VA/JA/VB/JB-gene segment (it's run four times, once with each gene segment type)

7、gex_graph_vs_tcr_features

Each row of the table represents a single significant association, in other words a neighborhood (defined by the central clonotype index) and a tcr feature.

The columns are as follows:

真的非常难，一遍很难全部搞懂，慢慢多来几遍

你可能感兴趣的:(10X单细胞（10X空间转录组）TCR转录组联合数据分析之(8)neighbor graph analysis（CoNGA）)

Some potentially useful statistics are stored in `adata.uns['conga_stats']`

First run the analysis, stashing the results in `adata.uns['conga_results']` and saving tables as `.tsv` files (since we pass in `outfile_prefix`)