scanpy软件官网:https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html
这个软件的官网分了几个部分进行介绍,每一个部分的练习数据都不一样,这一部分的练习数据下载地址:http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
首先下载数据:
$ wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
$ tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
然后安装scanpy:
$ pip install scanpy
进入python调用,调用不出错就是安装好了:
>>> import scanpy as sc
如果调用的时候报错,告诉你缺少什么tqdm.auto之类的,你可以这样:
#退出python,输入下面的代码:
$ pip uninstall tqdm #先卸载
$ pip install tqdm #再安装
准备数据
#示例数据来自健康人的3千个PBMC细胞,测序平台是10x Genomics
#调用软件
>>> import numpy as np
>>> import pandas as pd
>>> import scanpy as sc
#先设置一个h5ad的文件,这个文件用来保存我们一会儿分析的结果
>>> results_file = './write/pbmc3k.h5ad'
#读取数据
>>> adata = sc.read_10x_mtx(
'./filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
>>> adata.var_names_make_unique() #如果你上一步用的是`var_names='gene_ids',你就不用做这一步
#看一下adata
>>> adata
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
说明数据里有2700个细胞,32738个基因。
数据预处理
#看一下top20基因的表达情况
>>> sc.pl.highest_expr_genes(adata,n_top=20)
#过滤基因和细胞
>>> sc.pp.filter_cells(adata, min_genes=200)
>>> sc.pp.filter_genes(adata, min_cells=3)
>>> adata
AnnData object with n_obs × n_vars = 2700 × 13714
obs: 'n_genes'
var: 'gene_ids', 'n_cells'
#过滤完基因少了很多
#找出线粒体基因
>>> mito_genes = adata.var_names.str.startswith('MT-')
>>> adata.obs['percent_mito'] = np.sum(
adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
#计算每个细胞的总count数
>>> adata.obs['n_counts'] = adata.X.sum(axis=1).A1
#用小提琴图将质量信息可视化
>>> sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4, multi_panel=True)
还可以换一种方式可视化:
>>> sc.pl.scatter(adata, x='n_counts', y='percent_mito')
>>> sc.pl.scatter(adata, x='n_counts', y='n_genes')
下面这张图就是线粒体的含量,整体还是不错的,没有特别大的异常值:
下图是细胞和基因的关系,一般是线性关系,斜率越大越好,说明我们可以用较少的细胞测到较多的基因:
下面进行进一步的过滤:
>>> adata = adata[adata.obs.n_genes < 2500, :]
>>> adata = adata[adata.obs.percent_mito < 0.05, :]
>>> adata
View of AnnData object with n_obs × n_vars = 2638 × 13714
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells'
上面的AnnData对象有点像三大R包的对象,长这个样子:
具体的看一下对象里都有什么:
>>> adata.obs #每一个barcode对应的基因数,线粒体基因比例,检测到的count数
n_genes percent_mito n_counts
AAACATACAACCAC-1 781 0.030178 2419.0
AAACATTGAGCTAC-1 1352 0.037936 4903.0
AAACATTGATCAGC-1 1131 0.008897 3147.0
AAACCGTGCTTCCG-1 960 0.017431 2639.0
AAACCGTGTATGCG-1 522 0.012245 980.0
... ... ... ...
TTTCGAACTCTCAT-1 1155 0.021104 3459.0
TTTCTACTGAGGCA-1 1227 0.009294 3443.0
TTTCTACTTCCTCG-1 622 0.021971 1684.0
TTTGCATGAGAGGC-1 454 0.020548 1022.0
TTTGCATGCCTCAC-1 724 0.008065 1984.0
[2638 rows x 3 columns]
>>> adata.var #每一个基因在多少个细胞里表达
gene_ids n_cells
AL627309.1 ENSG00000237683 9
AP006222.2 ENSG00000228463 3
RP11-206L10.2 ENSG00000228327 5
RP11-206L10.9 ENSG00000237491 3
LINC00115 ENSG00000225880 18
... ... ...
AC145212.1 ENSG00000215750 16
AL592183.1 ENSG00000220023 323
AL354822.1 ENSG00000215615 8
PNRC2-1 ENSG00000215700 110
SRSF10-1 ENSG00000215699 69
[13714 rows x 2 columns]
关于AnnData对象的更多具体细节请看:单细胞转录组数据分析|| scanpy教程:预处理与聚类
数据标准化:
#标准化
>>> sc.pp.normalize_total(adata, target_sum=1e4)
#log标准化后的值
>>> sc.pp.log1p(adata)
#将标准化后的数值存为.raw属性,方便后续分析
>>> adata.raw = adata
鉴定高变基因:
#鉴定高变基因:
>>> sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#可视化高变基因
>>> sc.pl.highly_variable_genes(adata)
上图黑色的点是高变基因,其他基因是灰色点。
#只留下高变基因进行后续分析
>>> adata = adata[:, adata.var.highly_variable]
>>> adata
View of AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p'
#回归每个细胞的总计数和线粒体基因表达百分比的影响,将数据缩放到单位方差。
>>> sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
#scale数据
>>> sc.pp.scale(adata, max_value=10)
降维
PCA降维:
#PCA降维
>>> sc.tl.pca(adata, svd_solver='arpack')
#PCA可视化
>>> sc.pl.pca(adata, color='CST3')
也可以用碎石图来决定用多少个PC来进行临近细胞的计算:
sc.pl.pca_variance_ratio(adata, log=True)
保存结果:
>>> adata.write(results_file)
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca'
obsm: 'X_pca'
varm: 'PCs'
聚类
聚类前先计算neighborhood graph,先用默认值来计算一下:
>>> sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
将neighborhood graph嵌入:
>>> sc.tl.umap(adata)
>>> sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
如果我将上面的n_pcs=40换成n_pcs=8,(这个值根据上面的碎石图来,选拐点的数字)图就不一样了:
#neighborhood graph聚类
>>> sc.tl.leiden(adata)
>>> sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
>>> adata.write("umap.h5ad")
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
寻找Marker基因
给每一个cluster计算top差异基因:
(1)你可以使用t-test方法计算:
>>> sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
>>> sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False,fontsize=5)
>>> sc.settings.verbosity = 2 # reduce the verbosity
(2)你也可以用另一种方法来计算(推荐):
>>> sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
>>> sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False,fontsize=5)
然后可以看一下随便哪个cluster里的差异基因:
>>> sc.get.rank_genes_groups_df(adata, group="0")
scores names logfoldchanges pvals pvals_adj
0 32.783016 S100A9 7.394584 1.028132e-235 8.518724e-232
1 32.777248 LYZ 6.225223 1.242340e-235 8.518724e-232
2 32.260483 S100A8 7.533852 2.508019e-228 1.146499e-224
3 30.617512 TYROBP 5.391633 7.157181e-206 2.453839e-202
4 29.973402 FCN1 5.507009 2.180708e-197 5.981246e-194
.. ... ... ... ... ...
95 12.487227 LY86 2.596598 8.765488e-36 7.285449e-34
96 12.318965 C20orf24 1.944652 7.160747e-35 5.676444e-33
97 12.233775 CNPY3 2.075463 2.051759e-34 1.617116e-32
98 12.183536 TCEB2 1.488625 3.804261e-34 2.981236e-32
99 12.136163 ASGR1 5.176166 6.793834e-34 5.293786e-32
[100 rows x 5 columns]
查看所有cluster里的top10基因:
>>> pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(10)
0 1 2 3 4 5 ... 7 8 9 10 11 12
0 S100A9 CD74 LDHB RPL32 RPS12 CCL5 ... LST1 IL32 NKG7 NKG7 HLA-DPA1 PF4
1 LYZ CD79A LTB RPS12 RPS6 GZMK ... FCER1G CD3D GNLY CCL5 HLA-DPB1 GNG11
2 S100A8 HLA-DRA CD3D RPS6 RPL13 NKG7 ... COTL1 LDHB GZMB GZMH HLA-DRB1 SDPR
3 TYROBP CD79B IL32 RPS27 RPS3A IL32 ... AIF1 LTB CTSW CST7 HLA-DRA PPBP
4 FCN1 HLA-DPB1 TMEM66 RPS3 RPL3 GZMA ... FTH1 CD3E PRF1 B2M CD74 NRGN
5 FTL MS4A1 IL7R RPS14 RPS3 CTSW ... IFITM3 B2M CST7 HLA-C CST3 SPARC
6 CST3 HLA-DQA1 JUNB RPL13 RPL32 B2M ... SAT1 RPS3 GZMA HLA-A HLA-DRB5 GPX1
7 LGALS2 HLA-DRB1 TPT1 RPL21 RPS14 CST7 ... FTL RPS27 HLA-C GZMA HLA-DQA1 TPM4
8 S100A6 HLA-DQB1 GIMAP7 RPS25 RPS18 HLA-C ... PSAP RPS25 FGFBP2 CD3D HLA-DQB1 RGS18
9 FTH1 CD37 CD3E RPL31 EEF1A1 LYAR ... CTSS HLA-A B2M FGFBP2 FCER1A CALM3
将top10基因和它们的pval同时展示:
>>> result = adata.uns['rank_genes_groups']
>>> groups = result['names'].dtype.names
>>> pd.DataFrame(
... {group + '_' + key[:1]: result[key][group]
... for group in groups for key in ['names', 'pvals']}).head(10)
0_n 0_p 1_n 1_p ... 11_n 11_p 12_n 12_p
0 S100A9 1.028132e-235 CD74 3.599677e-184 ... HLA-DPA1 1.097649e-19 PF4 4.722886e-10
1 LYZ 1.242340e-235 CD79A 4.507668e-171 ... HLA-DPB1 7.563026e-19 GNG11 4.733899e-10
2 S100A8 2.508019e-228 HLA-DRA 5.186291e-168 ... HLA-DRB1 2.189338e-18 SDPR 4.733899e-10
3 TYROBP 7.157181e-206 CD79B 5.747138e-156 ... HLA-DRA 6.262321e-18 PPBP 4.744938e-10
4 FCN1 2.180708e-197 HLA-DPB1 1.257935e-147 ... CD74 2.286726e-17 NRGN 4.800511e-10
5 FTL 7.485109e-192 MS4A1 1.334052e-140 ... CST3 2.883852e-17 SPARC 4.947990e-10
6 CST3 5.932796e-191 HLA-DQA1 3.534036e-140 ... HLA-DRB5 6.818970e-17 GPX1 4.947990e-10
7 LGALS2 9.551453e-189 HLA-DRB1 1.510260e-130 ... HLA-DQA1 1.022928e-16 TPM4 5.159513e-10
8 S100A6 3.253730e-188 HLA-DQB1 6.020741e-130 ... HLA-DQB1 5.566163e-16 RGS18 5.195614e-10
9 FTH1 4.782225e-180 CD37 6.463741e-130 ... FCER1A 1.482060e-15 CALM3 6.197000e-10
[10 rows x 26 columns]
你还可以单独比较某两个cluster的差异基因:
#比如cluster 0和1:
>>> sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
>>> sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
再看一下小提琴图:
>>> sc.pl.rank_genes_groups_violin(adata, groups=['0'], n_genes=20)
重新load数据,可以看cluster 0和其他的cluster的比较:
>>> adata=sc.read("./write/pbmc3k.h5ad")
>>> sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=20)
比较所有cluster里某些特定的基因:
>>> sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
NOTE:下一步将细胞名称先存成一个列表(这一步官网的cluster数量和我的不一样,所以就先不做),把聚类的细胞注释:
#你有多少个cluster就要输入多少个名称
>>> new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes']
>>> adata.rename_categories('leiden', new_cluster_names)
>>> sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')
可视化每一个cluster的marker基因:
>>> marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
>>> ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
上面是点图,再画个小提琴图看一下:
>>> ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)
这一部分就告一段落,主要是简单的介绍一下scanpy的数据预处理和聚类。我们可以回顾一下adata这个对象里现在都存了哪些内容:
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
如果想保存adata对象里某一些内容,存成csv格式:
# Export single fields of the annotation of observations
>>> adata.obs[['n_counts', 'louvain_groups']].to_csv(
'./write/pbmc3k_corrected_louvain_groups.csv')
# Export single columns of the multidimensional annotation
>>> adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
'./write/pbmc3k_corrected_X_pca.csv')
# Or export everything except the data using `.write_csvs`.
# Set `skip_data=False` if you also want to export the data.
>>> adata.write_csvs(results_file[:-5], )