单细胞免疫组学分析练习-1:cellranger multi

理解BCR与TCR


BCR重排

B细胞从干细胞发育成成熟的B细胞之前,是要经历重链的VDJ基因重排和轻链的VJ基因重排的;成熟的B细胞,迁移到外周淋巴器官之后,在外界抗原的刺激下,则可能发生体细胞高频突变(SHM)。基因重排和体细胞高频突变,一个发生在B细胞成熟前,一个发生在B细胞成熟后。参考

TCR alpha链和BCR轻链由V-J-C构成
TCR beta链和BCR重链由V-D-J-C构成(D-diversity)


TCR和BCR的多样性

CDR3是V(D)J基因编码的核心区域,通常会包含V基因的一部分,然后D基因,还有J基因的一部分,因此是BCR或者TCR上最具有代表性的,最具有辨识度的一段区域,相当于一个人的脸。在绝大多数免疫研究中,会把CDR3序列作为定义和识别某一个特定BCR或者TCR的唯一依据。

1. 下载示例数据集

首先从10x genomics 官网上下载示例数据集
(1)10x genomics数据集链接
(2)我选择的是BALB/c小鼠的PBMC数据集
①表达数据

curl -O https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_5gex/vdj_v1_mm_balbc_pbmc_5gex_fastqs.tar

②BCR数据

curl -O https://cf.10xgenomics.com/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_b/vdj_v1_mm_balbc_pbmc_b_fastqs.tar

③TCR数据

curl -O https://cf.10xgenomics.com/samples/cell-vdj/3.0.0/vdj_v1_mm_balbc_pbmc_t/vdj_v1_mm_balbc_pbmc_t_fastqs.tar

数据比较大,建议在linux终端用screen,如果curl不行,就用wget

下载结束后:


解压.tar文件
tar xvf /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_b_fastqs.tar
解压大全

解压后:


是R1 R2 I1文件

复习fastq命名原则

[Sample Name]S1_L00[Lane Number][Read Type]_001.fastq.gz
Where Read Type is one of:
I1: Sample index read (optional)
R1: Read 1
R2: Read 2

2. Fastqc & MultiQC

[myh@bogon fastqc]$ cd /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ find /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs -name "*R1*.gz" > 5gex_id_1.txt

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ find /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs -name "*R2*.gz" > 5gex_id_2.txt

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ cat 5gex_id_1.txt 5gex_id_2.txt > 5gex_id_all.txt

[myh@bogon vdj_v1_mm_balbc_pbmc_5gex_fastqs]$ cat 5gex_id_all.txt | xargs fastqc -o /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs/fastqc_reports

使用multiQC整合

multiqc /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs/fastqc_reports -o /home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs/fastqc_reports/multiqc_outs

QC结果:以TCR为例


Sequence Quality Histograms
Per Sequence Quality Scores
Adapter Content

说明下载下来的数据是clean data,已经去除了接头序列。

3.cellranger

官方教程

下载小鼠VDJ ref data

wget https://cf.10xgenomics.com/supp/cell-vdj/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0.tar.gz
tar -xf /home/user/myh/ref_data/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0.tar.gz

下载小鼠转录组 ref data

Create a multi config CSV

利用nano创建csv

nano multi_config.csv

Copy and paste this text into the newly created file, and customize the code

[gene-expression]
reference,/home/user/myh/ref_data/refdata-gex-mm10-2020-A
expect-cells,1000
[vdj]
reference,/home/user/myh/ref_data/refdata-cellranger-vdj-GRCm38-alts-ensembl-5.0.0
[libraries]
fastq_id,fastqs,lanes,feature_types,subsample_rate
vdj_v1_mm_balbc_pbmc_5gex,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_5gex_fastqs,1|2,gene expression,
vdj_v1_mm_balbc_pbmc_b,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_b_fastqs,1|2,VDJ-B,
vdj_v1_mm_balbc_pbmc_t,/home/user/myh/raw_data/BALBC/vdj_v1_mm_balbc_pbmc_t_fastqs,1|2,VDJ-T,

Use your text editor's save command to save the file. In nano, save by typing CTRL+X → y → ENTER.

cell ranger multi

From within the working-directory/runs/ directory, run cellranger multi

cellranger multi --id=BALBC_multi --csv=/home/user/myh/raw_data/BALBC/multi_analysis/multi_config.csv

使用cellranger multi而非cellranger vdj可以同时分析转录组和TCR和BCR,更方便

完成:


解读outs文件:


官网解读

outs的结构

outs的结构


关于outs的解释
per_sample_outs/: folder containing filtered data, i.e., only cell-associated barcodes in this sample.
这里面的 sample_feature_bc_matrix 与cell ranger count得到的filtered_feature_bc_matrix类似

4.将count文件中的bam文件转变为loom(为了方便scVelo计算RNA速率)

因为cellranger multi的结果文件格式和cellranger count的结果文件格式不全相同,所以更推荐使用velocyto run而非velocyto run10x

velocyto run -b /home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/sample_feature_bc_matrix/barcodes.tsv.gz \
-o /home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/loom \
-m /home/user/myh/ref_data/mm10_allTracks.gtf \
/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/count/sample_alignments.bam \
/home/user/myh/ref_data/refdata-gex-mm10-2020-A/genes/genes.gtf

5.VDJ结果解读

(1)clonotypes.csv

  • clonotype_id:The ID of the clonotype to which this consensus sequence was assigned. 是这个样本内部依次排下去的序号,不同样本之间不通用!相同clonotype_id在不同样本之间不代表同一克隆型!
  • frequency: The observed number of cell barcodes with this clonotype.侧面部分反映了这个BCR/TCR的丰度。为什么说是部分反映呢,因为受到测序深度等方面的限制,可能会miss一些高频克隆型,因此这个参数有意义,同时也有局限性。
  • proportion: The observed fraction of cell barcodes with this clonotype.
  • cdr3s_aa: A semicolon-delimited list of chain:sequence pairs, where chain is for example TRA, TRB, IGK, IGL, or IGH and sequence is the CDR3 amino acid sequence for that chain. 反映了CDR3区的氨基酸序列


  • cdr3s_nt:CDR3区域的核苷酸组合


(2) filtered_contig_annotations.csv

contig的定义:Contiguous sequence of bases produced by assembly.

  • barcode: Cell-barcode for this contig.
  • is_cell: True or False value indicating whether the barcode was called as a cell.
  • contig_id:Unique identifier for this contig.
  • high_confidence: True or False value indicating whether the contig was called as high-confidence (unlikely to be a chimeric sequence or some other artifact).
  • length: The contig sequence length in nucleotides.
  • chain:The chain associated with this contig; for example, TRA, TRB, IGK, IGL, or IGH. A value of "Multi" indicates that segments from multiple chains were present.


  • xx_gene:The highest-scoring xx segment。这部分提供了V基因和J基因的具体片段,D区通常较短又突变较多,受技术限制,常常捕捉不到。
  • full-length: A contig is full-length if it matches the initial part of a V gene, continues on, and ultimately matches the terminal part of a J gene.
  • productive: 定义
BCR结果示例

其余:nt:碱基序列,fwr:framework region


  • raw_clonotype_id:The ID of the clonotype to which this cell barcode was assigned.
  • raw_consensus_id: The ID of the consensus sequence to which this contig was assigned.
  • exact_subclonotype_id: The ID of the exact subclontype to which this cell barcode was assigned.

结果文件夹中的fasta序列文件,存储每个clonotype的contig序列或者consensus序列。consensus序列可以理解成这个样本里,这个克隆型的所有细胞的序列的统一。就是假如这个克隆型在这个样本里表达了十个细胞,和reference比起来,有九个细胞的某个位点都由A突变成了T,而剩下的一个是由A变成G,那么consensus序列的这个位点就是T。它不是reference序列,而是样本内部的一种统一,这样做能有效排除个别细胞中低频SNP的干扰。

关于clonotype的一些思考

你可能感兴趣的:(单细胞免疫组学分析练习-1:cellranger multi)