理解BCR与TCR

BCR重排

B细胞从干细胞发育成成熟的B细胞之前，是要经历重链的VDJ基因重排和轻链的VJ基因重排的；成熟的B细胞，迁移到外周淋巴器官之后，在外界抗原的刺激下，则可能发生体细胞高频突变（SHM）。基因重排和体细胞高频突变，一个发生在B细胞成熟前，一个发生在B细胞成熟后。参考

TCR alpha链和BCR轻链由V-J-C构成
TCR beta链和BCR重链由V-D-J-C构成（D-diversity)

TCR和BCR的多样性

CDR3是V(D)J基因编码的核心区域，通常会包含V基因的一部分，然后D基因，还有J基因的一部分，因此是BCR或者TCR上最具有代表性的，最具有辨识度的一段区域，相当于一个人的脸。在绝大多数免疫研究中，会把CDR3序列作为定义和识别某一个特定BCR或者TCR的唯一依据。

1. 下载示例数据集

首先从10x genomics 官网上下载示例数据集
(1)10x genomics数据集链接
(2)我选择的是BALB/c小鼠的PBMC数据集
①表达数据

curl -O https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_5gex/vdj_v1_mm_balbc_pbmc_5gex_fastqs.tar

②BCR数据

curl -O https://cf.10xgenomics.com/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_b/vdj_v1_mm_balbc_pbmc_b_fastqs.tar

③TCR数据

curl -O https://cf.10xgenomics.com/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_t/vdj_v1_mm_balbc_pbmc_t_fastqs.tar

数据比较大，建议在linux终端用screen，如果curl不行，就用wget

下载结束后：

解压.tar文件
tar xvf /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_b_fastqs.tar
解压大全

解压后：

是R1 R2 I1文件

复习fastq命名原则

[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz
Where Read Type is one of:
I1: Sample index read (optional)
R1: Read 1
R2: Read 2

2. Fastqc & MultiQC

[myh@bogon fastqc]$ cd /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ find /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs -name "*R1*.gz" > 5gex_id_1.txt

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ find /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs -name "*R2*.gz" > 5gex_id_2.txt

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ cat 5gex_id_1.txt 5gex_id_2.txt > 5gex_id_all.txt

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ cat 5gex_id_all.txt | xargs fastqc -o /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs/fastqc_reports

使用multiQC整合

multiqc /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs/fastqc_reports -o /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs/fastqc_reports/multiqc_outs

QC结果：以TCR为例

Sequence Quality Histograms

Per Sequence Quality Scores

Adapter Content

说明下载下来的数据是clean data，已经去除了接头序列。

3.cellranger

官方教程

下载小鼠VDJ ref data

wget https://cf.10xgenomics.com/supp/cell-vdj/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0.tar.gz
tar -xf /home/user/myh/ref_data/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0.tar.gz

下载小鼠转录组 ref data

Create a multi config CSV

利用nano创建csv

nano multi_config.csv

Copy and paste this text into the newly created file, and customize the code

[gene-expression]
reference,/home/user/myh/ref_data/refdata-gex-mm10-2020-A
expect-cells,1000
[vdj]
reference,/home/user/myh/ref_data/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0
[libraries]
fastq_id,fastqs,lanes,feature_types,subsample_rate
vdj_v1_mm_balbc_pbmc_5gex,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs,1|2,gene expression,
vdj_v1_mm_balbc_pbmc_b,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_b_fastqs,1|2,VDJ-B,
vdj_v1_mm_balbc_pbmc_t,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs,1|2,VDJ-T,

Use your text editor's save command to save the file. In nano, save by typing CTRL+X → y → ENTER.

cell ranger multi

From within the working-directory/runs/ directory, run cellranger multi

cellranger multi --id=BALBC_multi --csv=/home/user/myh/raw_data/BALBC/multi_analysis/multi_config.csv

使用cellranger multi而非cellranger vdj可以同时分析转录组和TCR和BCR，更方便

完成：

解读outs文件：

官网解读

outs的结构

关于outs的解释
per_sample_outs/: folder containing filtered data, i.e., only cell-associated barcodes in this sample.
这里面的 sample_feature_bc_matrix 与cell ranger count得到的filtered_feature_bc_matrix类似

4.将count文件中的bam文件转变为loom(为了方便scVelo计算RNA速率）

因为cellranger multi的结果文件格式和cellranger count的结果文件格式不全相同，所以更推荐使用velocyto run而非velocyto run10x

velocyto run -b /home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/sample_feature_bc_matrix/barcodes.tsv.gz \
-o /home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/loom \
-m /home/user/myh/ref_data/mm10_allTracks.gtf \
/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/sample_alignments.bam \
/home/user/myh/ref_data/refdata-gex-mm10-2020-A/genes/genes.gtf

5.VDJ结果解读

（1）clonotypes.csv

clonotype_id：The ID of the clonotype to which this consensus sequence was assigned. 是这个样本内部依次排下去的序号，不同样本之间不通用！相同clonotype_id在不同样本之间不代表同一克隆型!
frequency: The observed number of cell barcodes with this clonotype.侧面部分反映了这个BCR/TCR的丰度。为什么说是部分反映呢，因为受到测序深度等方面的限制，可能会miss一些高频克隆型，因此这个参数有意义，同时也有局限性。
proportion: The observed fraction of cell barcodes with this clonotype.
cdr3s_aa: A semicolon-delimited list of chain:sequence pairs, where chain is for example TRA, TRB, IGK, IGL, or IGH and sequence is the CDR3 amino acid sequence for that chain. 反映了CDR3区的氨基酸序列
cdr3s_nt：CDR3区域的核苷酸组合

(2) filtered_contig_annotations.csv

contig的定义：Contiguous sequence of bases produced by assembly.

barcode： Cell-barcode for this contig.
is_cell： True or False value indicating whether the barcode was called as a cell.
contig_id：Unique identifier for this contig.
high_confidence： True or False value indicating whether the contig was called as high-confidence (unlikely to be a chimeric sequence or some other artifact).
length： The contig sequence length in nucleotides.
chain：The chain associated with this contig; for example, TRA, TRB, IGK, IGL, or IGH. A value of "Multi" indicates that segments from multiple chains were present.
xx_gene：The highest-scoring xx segment。这部分提供了V基因和J基因的具体片段，D区通常较短又突变较多，受技术限制，常常捕捉不到。
full-length: A contig is full-length if it matches the initial part of a V gene, continues on, and ultimately matches the terminal part of a J gene.
productive: 定义

BCR结果示例

其余：nt:碱基序列，fwr：framework region

raw_clonotype_id：The ID of the clonotype to which this cell barcode was assigned.
raw_consensus_id： The ID of the consensus sequence to which this contig was assigned.
exact_subclonotype_id： The ID of the exact subclontype to which this cell barcode was assigned.

结果文件夹中的fasta序列文件，存储每个clonotype的contig序列或者consensus序列。consensus序列可以理解成这个样本里，这个克隆型的所有细胞的序列的统一。就是假如这个克隆型在这个样本里表达了十个细胞，和reference比起来，有九个细胞的某个位点都由A突变成了T，而剩下的一个是由A变成G，那么consensus序列的这个位点就是T。它不是reference序列，而是样本内部的一种统一，这样做能有效排除个别细胞中低频SNP的干扰。

关于clonotype的一些思考

单细胞免疫组学分析练习-1:cellranger multi