10X单细胞（10X空间转录组）TCR数据分析之TCRdist3(4)

前面已经说了很多的基础概念和分析算法，但是大家应该注意到了，前面的分析都限制在TCRαβ序列，分析还是具有一定的限制性，今天我们稍微做一下扩展，优化TCR分析距离的同时，分析γδ TCRs,and at-scale computation with sparse data representations and parallelized, byte-compiled code.文献在TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, HLA restricted SARS-CoV-2 associated TCR features，这些文献都是具有承前启后的作用，这个专题，内容真的太多了。

来，看看分析框架

1、实验性抗原富集可以发现具有生化相似neighbors的 TCR

Searching for identical TCRs within a repertoire - arising either from clonal expansion or convergent nucleotide encoding of amino acids in the CDR3 - is a common strategy for identifying functionally important receptors。（这也是唯一实用的策略）,然而，在缺乏实验富集程序的情况下，在大量样本中观察到具有相同氨基酸 TCR 序列的 T 细胞是很少见的。例如，在来自脐带血样本的 10,000 个 β 链 TCR 中，少于 1% 的 TCR 氨基酸序列被多次观察到，包括可能的克隆扩增(疾病确实会导致TCR的特异性扩增，这是研究的核心).

图片.png

图注：TCR repertoire subsets obtained by single-cell sorting with peptide-MHC tetramers。

2、TCR biochemical neighborhood density is heterogeneous in antigen-enriched repertoires

We next investigated the proportion of unique TCRs with at least one biochemically similar neighbor among TCRs with the same putative antigen specificity.We and others have shown that a single peptide-MHC epitope is often recognized by many distinct TCRs with closely related amino acid sequences（识别抗原的TCR序列具有多样性，多对一的关系，这就复杂了），这个时候就必须找序列之间的相似性（也就是前面提到的TCR instance），以寻求共性。We observed the highest density neighborhoods within repertoires that were sorted based on peptide-MHC tetramer binding（看来刺激的作用很明显）。these observations suggest that biochemical neighborhood density is highly heterogeneous among TCRs and that it may depend on mechanisms of antigen-recognition as well as receptor V(D)J recombination biases。（按照这个情况，这个难以研究）。

3、Meta-clonotype radius can be tuned to balance a biomarker’s sensitivity and specificity

基于 TCR 的生物标志物的效用取决于 TCR 的抗原特异性，a key constraint on distance-based clustering is the presence of similar TCR sequences that may lack the ability to recognize the target antigen.（说白了就行要定义相似性的半径）,To be useful, a meta-clonotype definition should be broad enough to capture multiple biochemically similar TCRs with shared antigen-recognition, but not excessively broad as to include a high proportion of non-specific TCRs, which might be found in unenriched background repertoires that are largely antigen-naïve(半径的大小要合适)。但是TCR“邻居”的相似性密度是异质的。

An ideal radius-defined meta-clonotype would include a high density of TCRs in antigen experienced individuals indicative of shared antigen specificity, yet a low density of TCRs among an antigen-naïve background.接下来就是寻找抗原转移性的TCR序列了。我们来看看分析的代码（TCRdist3）。

第一部分代码，TCRdist

看看输入的数据格式，跟我们10X分析出来的结果很类似

图片.png

来，现场教大家写代码

103338268-aa3ee180-4a32-11eb-8149-056fb385b33b.gif

默认参数

"""
If you just want a 'tcrdistances' using pre-set default setting.

    You can access distance matrices:
        tr.pw_alpha     - alpha chain pairwise distance matrix
        tr.pw_beta      - alpha chain pairwise distance matrix
        tr.pw_cdr3_a_aa - cdr3 alpha chain distance matrix
        tr.pw_cdr3_b_aa - cdr3 beta chain distance matrix
"""
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

tr.pw_alpha
tr.pw_beta
tr.pw_cdr3_a_aa
tr.pw_cdr3_b_aa

调整一个默认参数

"""  
If you want 'tcrdistances' with changes over some parameters.
For instance you want to change the gap penalty on CDR3s to 5. 
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False,
            db_file = 'alphabeta_gammadelta_db.tsv')

tr.kargs_a['cdr3_a_aa']['gap_penalty'] = 5 
tr.kargs_b['cdr3_b_aa']['gap_penalty'] = 5 

tr.compute_distances()

tr.pw_alpha
tr.pw_beta

人为完全控制距离的计算(对代码的水平要求有点高)

"""
If want a 'tcrdistances' AND you want control over EVERY parameter.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            compute_distances = False,
            db_file = 'alphabeta_gammadelta_db.tsv')

metrics_a = {
    "cdr3_a_aa" : pw.metrics.nb_vector_tcrdist,
    "pmhc_a_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr2_a_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr1_a_aa" : pw.metrics.nb_vector_tcrdist}

metrics_b = {
    "cdr3_b_aa" : pw.metrics.nb_vector_tcrdist,
    "pmhc_b_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr2_b_aa" : pw.metrics.nb_vector_tcrdist,
    "cdr1_b_aa" : pw.metrics.nb_vector_tcrdist }

weights_a= { 
    "cdr3_a_aa" : 3,
    "pmhc_a_aa" : 1,
    "cdr2_a_aa" : 1,
    "cdr1_a_aa" : 1}

weights_b = { 
    "cdr3_b_aa" : 3,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_a = {  
    'cdr3_a_aa' : 
        {'use_numba': True, 
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 
        'dist_weight': 1, 
        'gap_penalty':4, 
        'ntrim':3, 
        'ctrim':2, 
        'fixed_gappos': False},
    'pmhc_a_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr2_a_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight': 1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr1_a_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True}
    }
kargs_b= {  
    'cdr3_b_aa' : 
        {'use_numba': True, 
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 
        'dist_weight': 1, 
        'gap_penalty':4, 
        'ntrim':3, 
        'ctrim':2, 
        'fixed_gappos': False},
    'pmhc_b_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight': 1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr2_b_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True},
    'cdr1_b_aa' : {
        'use_numba': True,
        'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
        'dist_weight':1,
        'gap_penalty':4,
        'ntrim':0,
        'ctrim':0,
        'fixed_gappos':True}
    }   

tr.metrics_a = metrics_a
tr.metrics_b = metrics_b

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.kargs_a = kargs_a 
tr.kargs_b = kargs_b

只考虑不匹配的计算

"""
If you want "tcrdistances" using a different metric.

Here we illustrate the use a metric that uses the 
Needleman-Wunsch algorithm to align sequences and then 
calculate the number of mismatching positions (pw.metrics.nw_hamming_metric)

This method doesn't rely on Numba so it can run faster using multiple cpus.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing

df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            use_defaults=False,
            compute_distances = False,
            cpus = 1,
            db_file = 'alphabeta_gammadelta_db.tsv')

metrics_a = {
    "cdr3_a_aa" : pw.metrics.nw_hamming_metric ,
    "pmhc_a_aa" : pw.metrics.nw_hamming_metric ,
    "cdr2_a_aa" : pw.metrics.nw_hamming_metric ,
    "cdr1_a_aa" : pw.metrics.nw_hamming_metric }

metrics_b = {
    "cdr3_b_aa" : pw.metrics.nw_hamming_metric ,
    "pmhc_b_aa" : pw.metrics.nw_hamming_metric ,
    "cdr2_b_aa" : pw.metrics.nw_hamming_metric ,
    "cdr1_b_aa" : pw.metrics.nw_hamming_metric  }

weights_a = { 
    "cdr3_a_aa" : 1,
    "pmhc_a_aa" : 1,
    "cdr2_a_aa" : 1,
    "cdr1_a_aa" : 1}

weights_b = { 
    "cdr3_b_aa" : 1,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_a = {  
    'cdr3_a_aa' : 
        {'use_numba': False},
    'pmhc_a_aa' : {
        'use_numba': False},
    'cdr2_a_aa' : {
        'use_numba': False},
    'cdr1_a_aa' : {
        'use_numba': False}
    }
kargs_b = {  
    'cdr3_b_aa' : 
        {'use_numba': False},
    'pmhc_b_aa' : {
        'use_numba': False},
    'cdr2_b_aa' : {
        'use_numba': False},
    'cdr1_b_aa' : {
        'use_numba': False}
    }

tr.metrics_a = metrics_a
tr.metrics_b = metrics_b

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.kargs_a = kargs_a 
tr.kargs_b = kargs_b

tr.compute_distances()

tr.pw_cdr3_b_aa
tr.pw_beta

自定义距离度量

"""
If you want a tcrdistance, but you want to use your own metric. 
(A valid metric takes two strings and returns a numerical distance).  

def my_own_metric(s1,s2):   
    return Levenshtein.distance(s1,s2)
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing

df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            use_defaults=False,
            compute_distances = False,
            cpus = 1,
            db_file = 'alphabeta_gammadelta_db.tsv')

metrics_a = {
    "cdr3_a_aa" : my_own_metric ,
    "pmhc_a_aa" : my_own_metric ,
    "cdr2_a_aa" : my_own_metric ,
    "cdr1_a_aa" : my_own_metric }

metrics_b = {
    "cdr3_b_aa" : my_own_metric ,
    "pmhc_b_aa" : my_own_metric ,
    "cdr2_b_aa" : my_own_metric,
    "cdr1_b_aa" : my_own_metric }

weights_a = { 
    "cdr3_a_aa" : 1,
    "pmhc_a_aa" : 1,
    "cdr2_a_aa" : 1,
    "cdr1_a_aa" : 1}

weights_b = { 
    "cdr3_b_aa" : 1,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_a = {  
    'cdr3_a_aa' : 
        {'use_numba': False},
    'pmhc_a_aa' : {
        'use_numba': False},
    'cdr2_a_aa' : {
        'use_numba': False},
    'cdr1_a_aa' : {
        'use_numba': False}
    }
kargs_b = {  
    'cdr3_b_aa' : 
        {'use_numba': False},
    'pmhc_b_aa' : {
        'use_numba': False},
    'cdr2_b_aa' : {
        'use_numba': False},
    'cdr1_b_aa' : {
        'use_numba': False}
    }

tr.metrics_a = metrics_a
tr.metrics_b = metrics_b

tr.weights_a = weights_a
tr.weights_b = weights_b

tr.kargs_a = kargs_a 
tr.kargs_b = kargs_b

tr.compute_distances()

tr.pw_cdr3_b_aa
tr.pw_beta

I want tcrdistances, but I hate OOP

"""
If you don't want to use OOP, but you I still want a multi-CDR 
tcrdistances on a single chain, using you own metric 

def my_own_metric(s1,s2):   
    return Levenshtein.distance(s1,s2)    
"""
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw

df = pd.read_csv("dash2.csv")

metrics_b = {
    "cdr3_b_aa" : my_own_metric ,
    "pmhc_b_aa" : my_own_metric ,
    "cdr2_b_aa" : my_own_metric ,
    "cdr1_b_aa" : my_own_metric }

weights_b = { 
    "cdr3_b_aa" : 1,
    "pmhc_b_aa" : 1,
    "cdr2_b_aa" : 1,
    "cdr1_b_aa" : 1}

kargs_b = {  
    'cdr3_b_aa' : 
        {'use_numba': False},
    'pmhc_b_aa' : {
        'use_numba': False},
    'cdr2_b_aa' : {
        'use_numba': False},
    'cdr1_b_aa' : {
        'use_numba': False}
    }

dmats =  _pws(df = df , 
            metrics = metrics_b, 
            weights = weights_b, 
            kargs   = kargs_b , 
            cpu     = 1, 
            uniquify= True, 
            store   = True)

print(dmats.keys())

仅考虑CDR3

"""
If you hate object oriented programming, just show me the functions. 
No problem. 

Maybe you only care about the CDR3 on the beta chain.

def my_own_metric(s1,s2):   
    return Levenshtein.distance(s1,s2)
"""  
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw

df = pd.read_csv("dash2.csv")

# 
dmat = _pw( metric = my_own_metric,
            seqs1 = df['cdr3_b_aa'].values,
            ncpus=2,
            uniqify=True,
            use_numba=False)

I want tcrdistances but I want to keep my variable names

"""
You want a 'tcrdistance' but you don't want to bother with the tcrdist3 framework. 

Note that the columns names are completely arbitrary under this 
framework, so one can directly compute a tcrdist on a 
AIRR, MIXCR, VDJTools, or other formated file without any
reformatting.
""" 
import multiprocessing
import pandas as pd
import pwseqdist as pw
from tcrdist.rep_funcs import _pws, _pw  

df_airr = pd.read_csv("dash_beta_airr.csv")

# Choose the metrics you want to apply to each CDR
metrics = { 'cdr3_aa' : pw.metrics.nb_vector_tcrdist,
            'cdr2_aa' : pw.metrics.nb_vector_tcrdist,
            'cdr1_aa' : pw.metrics.nb_vector_tcrdist}

# Choose the weights that are right for you.
weights = { 'cdr3_aa' : 3,
            'cdr2_aa' : 1,
            'cdr1_aa' : 1}

# Provide arguments for the distance metrics 
kargs = {   'cdr3_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos':False},
            'cdr2_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True},
            'cdr1_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}}
            
# Here are your distance matrices
from tcrdist.rep_funcs import _pws

dmats = _pws(df = df_airr,
         metrics = metrics, 
         weights= weights, 
         kargs=kargs, 
         cpu = 1, 
         store = True)

dmats['tcrdist']

I want to use TCRrep but I want to keep my variable names

"""
If you already have a clones file and want 
to compute 'tcrdistances' on a DataFrame with 
custom columns names.

Set:
1. Assign TCRrep.clone_df
2. set infer_cdrs = False,
3. compute_distances = False
4. deduplicate = False
5. customize the keys for metrics, weights, and kargs with the lambda
    customize = lambda d : {new_cols[k]:v for k,v in d.items()} 
6. call .calculate_distances()
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep

new_cols = {'cdr3_a_aa':'c3a', 'pmhc_a_aa':'pa', 'cdr2_a_aa':'c2a','cdr1_a_aa':'c1a',
            'cdr3_b_aa':'c3b', 'pmhc_b_aa':'pb', 'cdr2_b_aa':'c2b','cdr1_b_aa':'c1b'}

df = pd.read_csv("dash2.csv").rename(columns = new_cols) 

tr = TCRrep(
        cell_df = df,
        clone_df = df,              #(1)
        organism = 'mouse', 
        chains = ['alpha','beta'],
        infer_all_genes = True, 
        infer_cdrs = False,         #(2)s
        compute_distances = False,  #(3)
        deduplicate=False,          #(4)
        db_file = 'alphabeta_gammadelta_db.tsv')

customize = lambda d : {new_cols[k]:v for k,v in d.items()} #(5)
tr.metrics_a = customize(tr.metrics_a)
tr.metrics_b = customize(tr.metrics_b)
tr.weights_a = customize(tr.weights_a)
tr.weights_b = customize(tr.weights_b)
tr.kargs_a = customize(tr.kargs_a)
tr.kargs_b = customize(tr.kargs_b)

tr.compute_distances() #(6)

# Notice that pairwise results now have custom names 
tr.pw_c3b
tr.pw_c3a
tr.pw_alpha
tr.pw_beta

####### I want distances from 1 TCR to many TCRs

"""
If you just want a 'tcrdistances' of some target seqs against another set.

(1) cell_df is asigned the first 10 cells in dash.csv
(2) compute tcrdistances with default settings.
(3) compute rectangular distance between clone_df and df2.
(4) compute rectangular distance between clone_df and any 
arbtirary df3, which need not be associated with the TCRrep object.
(5) compute rectangular distance with only a subset of the TCRrep.clone_df
"""
import pandas as pd
from tcrdist.repertoire import TCRrep

df = pd.read_csv("dash.csv")
df2 = pd.read_csv("dash2.csv")
df = df.head(10)                        #(1)
tr = TCRrep(cell_df = df,               #(2)
            df2 = df2, 
            organism = 'mouse', 
            chains = ['alpha','beta'], 
            db_file = 'alphabeta_gammadelta_db.tsv')

assert tr.pw_alpha.shape == (10,10) 
assert tr.pw_beta.shape  == (10,10)

tr.compute_rect_distances()             # (3) 
assert tr.rw_alpha.shape == (10,1924) 
assert tr.rw_beta.shape  == (10,1924)

df3 = df2.head(100)

tr.compute_rect_distances(df = tr.clone_df, df2 = df3)  # (4) 
assert tr.rw_alpha.shape == (10,100) 
assert tr.rw_beta.shape  == (10,100)

tr.compute_rect_distances(  df = tr.clone_df.iloc[0:2,], # (5)
                            df2 = df3)  
assert tr.rw_alpha.shape == (2,100) 
assert tr.rw_beta.shape  == (2,100)

个性化程度真的高，也确实很难

生活很好，有你更好，下一篇我们继续分享TCRdist3的分析代码

10X单细胞（10X空间转录组）TCR数据分析之TCRdist3(4)

来，看看分析框架

1、实验性抗原富集可以发现具有生化相似neighbors的 TCR

2、TCR biochemical neighborhood density is heterogeneous in antigen-enriched repertoires

3、Meta-clonotype radius can be tuned to balance a biomarker’s sensitivity and specificity

第一部分代码，TCRdist

看看输入的数据格式，跟我们10X分析出来的结果很类似

来，现场教大家写代码

默认参数

调整一个默认参数

人为完全控制距离的计算(对代码的水平要求有点高)

只考虑不匹配的计算

自定义距离度量

I want tcrdistances, but I hate OOP

仅考虑CDR3

I want tcrdistances but I want to keep my variable names

I want to use TCRrep but I want to keep my variable names

个性化程度真的高，也确实很难

你可能感兴趣的:(10X单细胞（10X空间转录组）TCR数据分析之TCRdist3(4))