2020.09.09 本教程介绍了Scanpy包自带的用于整合样本,并处理批次效应的BBKNN算法和用于对比的ingest基础算法。本文主要从函数的理解、软件包的使用和结果的解释入手,在PBMC和Pancreas两个数据集上实现,偏重于应用,基本不涉及批次效应的理论。批次效应的理论原理将在未来通过2篇论文来解释。
1. 依赖:
(1)数据集(全部需要挂VPN):
- PBMC:pbmc3k_processed()(需要下载);pbmc68k_reduced()(scanpy自带)
- Pancreas(需要下载)
(2)Python包:Scanpy、BBKNN
2. PBMC数据集
导入所需的包
import scanpy as sc
import pandas as pd
import seaborn as sns
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')
读取已注释好的参考数据集和待嵌入数据集
# 参考数据集(已预处理、降维、聚类、注释)
adata_ref = sc.datasets.pbmc3k_processed()
# 待嵌入数据集(已预处理、降维、聚类、注释)
adata = sc.datasets.pbmc68k_reduced()
查看adata和adata_ref,记住n_var也就是基因数。
adata_ref
Out[50]:
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities'
adata
Out[51]:
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
可以看到它们都经过了降维和louvain聚类,adata经过了umap嵌入,adata_ref还经过tsne嵌入。接下来查看变量。
adata_ref.obs
Out[52]:
n_genes percent_mito n_counts louvain
index
AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
... ... ... ...
TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes
TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells
TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells
[2638 rows x 4 columns]
adata.obs
Out[53]:
bulk_labels n_genes ... phase louvain
index ...
AAAGCCTGGCTAAC-1 CD14+ Monocyte 1003 ... G1 1
AAATTCGATGCACA-1 Dendritic 1080 ... S 1
AACACGTGGTCTTT-1 CD56+ NK 1228 ... G1 3
AAGTGCACGTGCTA-1 CD4+/CD25 T Reg 1007 ... G2M 9
ACACGAACGGAGTG-1 Dendritic 1178 ... G1 2
... ... ... ... ...
TGGCACCTCCAACA-8 Dendritic 1166 ... G1 2
TGTGAGTGCTTTAC-8 Dendritic 1014 ... G1 1
TGTTACTGGCGATT-8 CD4+/CD25 T Reg 1079 ... S 0
TTCAGTACCGGGAA-8 CD19+ B 1030 ... S 4
TTGAGGTGGAGAGC-8 Dendritic 1552 ... G1 2
[700 rows x 8 columns]
可以看到,两个数据集都分别做了注释,列名分别为louvain和bulk_labels。然后我们分别绘制umap可视化图。
sc.pl.umap(adata_ref, color='louvain')
sc.pl.umap(adata, color='bulk_labels')
要使用sc.tl.ingest,需要在相同的var上定义数据集,即两个数据集的var_names(即基因名)取交集,保留交集的var_names。
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]
查看adata和adata_ref
adata_ref
Out[73]:
View of AnnData object with n_obs × n_vars = 2638 × 208
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities'
adata
Out[74]:
View of AnnData object with n_obs × n_vars = 700 × 208
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
可以看到两个数据集的基因数都只剩了208个。
接下来,对取交集基因的参考注释的数据集adata_ref重新降维聚类可视化。
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.pl.umap(adata_ref, color='louvain')
接下来,使用ingest函数将参考数据集adata_ref的注释(louvain),映射到待注释数据集adata上。首先看一下ingest函数的参数。
scanpy.tl.ingest(adata, adata_ref, obs=None, embedding_method='umap', 'pca', labeling_method='knn', neighbors_key=None, inplace=True, **kwargs)
第一个参数是待映射数据集的andata,第二个参数是参考数据集的andata。obs是指参考数据集的注释变量名。剩余的参数表明,ingest通过投影到拟合adata_ref交集基因的PCA上,将adata的嵌入(embeddings)和注释与参考数据集adata_ref集成在一起。
sc.tl.ingest(adata, adata_ref, obs='louvain')
adata.uns['louvain_colors'] = adata_ref.uns['louvain_colors']
sc.pl.umap(adata, color=['louvain', 'bulk_labels'], wspace=0.5)
通过对比图片adata2所展示的原注释与映射注释的对比,可以看到数据基本映射正确,但树突状细胞Dendritic cells的注释有很大差别。接下来,我们再展示一下整合两个数据集的整合数据集的可视化。
adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
adata_concat.obs.louvain = adata_concat.obs.louvain.astype('category')
adata_concat.obs.louvain.cat.reorder_categories(adata_ref.obs.louvain.cat.categories, inplace=True) # fix category ordering
adata_concat.uns['louvain_colors'] = adata_ref.uns['louvain_colors'] # fix category colors
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
接下来使用anndata的concatenate函数将ref(adata_ref)和new(adata)数据集合并。可以看到,新数据集adata_concat的细胞数是原来两个数据集之和。
adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
adata_concat
Out[84]:
AnnData object with n_obs × n_vars = 3338 × 208
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'bulk_labels', 'S_score', 'G2M_score', 'phase', 'batch'
var: 'n_counts-new', 'means-new', 'dispersions-new', 'dispersions_norm-new', 'highly_variable-new', 'n_cells-ref'
obsm: 'X_pca', 'X_umap'
展示整合的数据集中ref和new的情况。
# 这3行好像没啥用
adata_concat.obs.louvain = adata_concat.obs.louvain.astype('category')
adata_concat.obs.louvain.cat.reorder_categories(adata_ref.obs.louvain.cat.categories, inplace=True) # fix category ordering
adata_concat.uns['louvain_colors'] = adata_ref.uns['louvain_colors'] # fix category colors
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
从左图看起来,用ingest整合的数据集还是存在比较严重的批次效应的。尤其是单核细胞。因此我们还是要尝试使用其他整合方法,比如本文接下来介绍的BBKNN。对ingest整合好的数据集进行PCA。
## BBKNN
sc.tl.pca(adata_concat)
sc.external.pp.bbknn(adata_concat, batch_key='batch')
sc.tl.umap(adata_concat)
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
可以看到BBKNN似乎可以更均匀地混合两个数据集。但是单核细胞仍存在批次效应。那么,没有经过ingest直接整合取交集基因的两个数据集,其作图会是什么样子的呢?执行以下代码:
import scanpy as sc
import pandas as pd
import seaborn as sns
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')
adata_ref = sc.datasets.pbmc3k_processed()
adata = sc.datasets.pbmc68k_reduced()
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
sc.pl.umap(adata_concat, color=['batch'])
图7与图5相比,可以发现ingest除了映射注释关系外,确实也有一定的矫正批次效应的作用。
3 Pancreas数据集
下面我们通过胰腺数据集对BBKNN算法加深理解,步骤和上面的PBMC基本相同,直接上代码。
import scanpy as sc
import pandas as pd
import seaborn as sns
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')
"""
The following data has been used in the scGen paper [Lotfollahi19], has been used here,
was curated here and can be downloaded from here (the BBKNN paper).
It contains data for human pancreas from 4 different studies (Segerstolpe16, Baron16, Wang16, Muraro16),
which have been used in the seminal papers on single-cell dataset integration (Butler18, Haghverdi18) and many times ever since.
"""
adata_all = sc.read('data/pancreas.h5ad')
#! adata_all.shape
观察数据结构:
adata_all
Out[4]:
AnnData object with n_obs × n_vars = 14693 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities'
adata_all.obs
Out[5]:
celltype sample ... n_counts louvain
index ...
human1_lib1.final_cell_0001-0 acinar Baron ... 2.241100e+04 2
human1_lib1.final_cell_0002-0 acinar Baron ... 2.794900e+04 2
human1_lib1.final_cell_0003-0 acinar Baron ... 1.689200e+04 2
human1_lib1.final_cell_0004-0 acinar Baron ... 1.929900e+04 2
human1_lib1.final_cell_0005-0 acinar Baron ... 1.506700e+04 2
... ... ... ... ...
reads.29499-3 ductal Wang ... 1.056558e+06 10
reads.29500-3 ductal Wang ... 9.926309e+05 10
reads.29501-3 beta Wang ... 1.751338e+06 10
reads.29502-3 dropped Wang ... 2.163764e+06 10
reads.29503-3 beta Wang ... 2.038979e+06 10
[14693 rows x 6 columns]
adata_all.var
Out[6]:
n_cells-0 n_cells-1 n_cells-2 n_cells-3
index
A2M 273.0 120.0 262.0 635.0
ABAT 841.0 988.0 1052.0 635.0
ABCA1 418.0 623.0 468.0 635.0
ABCA17P 63.0 NaN 104.0 635.0
ABCA7 410.0 74.0 1096.0 635.0
... ... ... ...
MIR663A NaN NaN 492.0 NaN
LOC100379224 NaN 65.0 125.0 635.0
LOC100130093 NaN 228.0 764.0 635.0
LOC101928303 NaN NaN 533.0 NaN
COPG NaN NaN NaN 635.0
[2448 rows x 4 columns]
adata_all.to_df()
Out[8]:
index A2M ABAT ... LOC101928303 COPG
index ...
human1_lib1.final_cell_0001-0 -0.185482 1.263687 ... -0.122359 -0.117050
human1_lib1.final_cell_0002-0 -0.185482 -0.413217 ... -0.122359 -0.117050
human1_lib1.final_cell_0003-0 -0.185482 -0.413217 ... -0.122359 -0.117050
human1_lib1.final_cell_0004-0 -0.185482 -0.413217 ... -0.122359 -0.117050
human1_lib1.final_cell_0005-0 -0.185482 -0.413217 ... -0.122359 -0.117050
... ... ... ... ...
reads.29499-3 -0.179833 -0.347129 ... -0.122359 2.529835
reads.29500-3 -0.163211 -0.408504 ... -0.122359 -0.105572
reads.29501-3 -0.165839 2.958294 ... -0.122359 -0.019103
reads.29502-3 -0.183122 -0.194750 ... -0.122359 1.252131
reads.29503-3 -0.047958 -0.410738 ... -0.122359 0.058300
[14693 rows x 2448 columns]
根据上述结果,推测该数据集是四个数据集的原始表达矩阵(未标准化)直接组合,然后经过了标准化、归一化、PCA、聚类、注释和umap而成的Anndata。从这里我们也可以知道,怎样去制作用于输入BBKNN的数据集格式。
下面是官网教程的几个步骤。排除了少于5个细胞的细胞亚群,然后展示数据集的批次效应情况。
## Inspect the cell types observed in these studies.
counts = adata_all.obs.celltype.value_counts()
#! count
## To simplify visualization, let’s remove the 5 minority classes
# get the minority classes
minority_classes = counts.index[-5:].tolist()
# actually subset
adata_all = adata_all[~adata_all.obs.celltype.isin(minority_classes)]
# reorder according to abundance
adata_all.obs.celltype.cat.reorder_categories(counts.index[:-5].tolist(), inplace=True)
## Seeing the batch effect
sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color=['sample'], palette=sc.pl.palettes.vega_20_scanpy)
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(7, 3), facecolor='white')
sc.pl.umap(adata_all, color=['celltype'], palette=sc.pl.palettes.vega_20_scanpy)
可以看到,批次效应很严重。接下来使用BBKNN校正。
## BBKNN
sc.external.pp.bbknn(adata_all, batch_key='batch')
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color=['batch', 'celltype'])
从下图右的细胞注释结果的分布上看,不同样本的同种细胞被成功的嵌入到了一起,证明了BBKNN具有良好的去除批次效应的效果。