利用scanpy进行单细胞测序分析(一)预处理和聚类

scanpy软件官网:https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html

这个软件的官网分了几个部分进行介绍,每一个部分的练习数据都不一样,这一部分的练习数据下载地址:http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

首先下载数据:

$ wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
$ tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz 

然后安装scanpy:

$ pip install scanpy

进入python调用,调用不出错就是安装好了:

>>> import scanpy as sc

如果调用的时候报错,告诉你缺少什么tqdm.auto之类的,你可以这样:

#退出python,输入下面的代码:
$ pip uninstall tqdm  #先卸载
$ pip install tqdm #再安装

准备数据

#示例数据来自健康人的3千个PBMC细胞,测序平台是10x Genomics
#调用软件
>>> import numpy as np 
>>> import pandas as pd 
>>> import scanpy as sc
#先设置一个h5ad的文件,这个文件用来保存我们一会儿分析的结果
>>> results_file = './write/pbmc3k.h5ad'
#读取数据
>>> adata = sc.read_10x_mtx(
    './filtered_gene_bc_matrices/hg19/',  
    var_names='gene_symbols',                  
    cache=True)
>>> adata.var_names_make_unique() #如果你上一步用的是`var_names='gene_ids',你就不用做这一步
#看一下adata
>>> adata
AnnData object with n_obs × n_vars = 2700 × 32738 
    var: 'gene_ids'

说明数据里有2700个细胞,32738个基因。

数据预处理

#看一下top20基因的表达情况
>>> sc.pl.highest_expr_genes(adata,n_top=20)
占所有的count里比例最高的20个基因
#过滤基因和细胞
>>> sc.pp.filter_cells(adata, min_genes=200)
>>> sc.pp.filter_genes(adata, min_cells=3)
>>> adata
AnnData object with n_obs × n_vars = 2700 × 13714 
    obs: 'n_genes'
    var: 'gene_ids', 'n_cells'
#过滤完基因少了很多
#找出线粒体基因
>>> mito_genes = adata.var_names.str.startswith('MT-')
>>> adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
#计算每个细胞的总count数
>>> adata.obs['n_counts'] = adata.X.sum(axis=1).A1
#用小提琴图将质量信息可视化
>>> sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)

还可以换一种方式可视化:

>>> sc.pl.scatter(adata, x='n_counts', y='percent_mito')
>>> sc.pl.scatter(adata, x='n_counts', y='n_genes')

下面这张图就是线粒体的含量,整体还是不错的,没有特别大的异常值:

下图是细胞和基因的关系,一般是线性关系,斜率越大越好,说明我们可以用较少的细胞测到较多的基因:

下面进行进一步的过滤:

>>> adata = adata[adata.obs.n_genes < 2500, :]
>>> adata = adata[adata.obs.percent_mito < 0.05, :]
>>> adata
View of AnnData object with n_obs × n_vars = 2638 × 13714 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells'

上面的AnnData对象有点像三大R包的对象,长这个样子:

具体的看一下对象里都有什么:

>>> adata.obs #每一个barcode对应的基因数,线粒体基因比例,检测到的count数
                  n_genes  percent_mito  n_counts
AAACATACAACCAC-1      781      0.030178    2419.0
AAACATTGAGCTAC-1     1352      0.037936    4903.0
AAACATTGATCAGC-1     1131      0.008897    3147.0
AAACCGTGCTTCCG-1      960      0.017431    2639.0
AAACCGTGTATGCG-1      522      0.012245     980.0
...                   ...           ...       ...
TTTCGAACTCTCAT-1     1155      0.021104    3459.0
TTTCTACTGAGGCA-1     1227      0.009294    3443.0
TTTCTACTTCCTCG-1      622      0.021971    1684.0
TTTGCATGAGAGGC-1      454      0.020548    1022.0
TTTGCATGCCTCAC-1      724      0.008065    1984.0

[2638 rows x 3 columns]
>>> adata.var  #每一个基因在多少个细胞里表达
                      gene_ids  n_cells
AL627309.1     ENSG00000237683        9
AP006222.2     ENSG00000228463        3
RP11-206L10.2  ENSG00000228327        5
RP11-206L10.9  ENSG00000237491        3
LINC00115      ENSG00000225880       18
...                        ...      ...
AC145212.1     ENSG00000215750       16
AL592183.1     ENSG00000220023      323
AL354822.1     ENSG00000215615        8
PNRC2-1        ENSG00000215700      110
SRSF10-1       ENSG00000215699       69

[13714 rows x 2 columns]

关于AnnData对象的更多具体细节请看:单细胞转录组数据分析|| scanpy教程:预处理与聚类

数据标准化:

#标准化
>>> sc.pp.normalize_total(adata, target_sum=1e4)
#log标准化后的值
>>> sc.pp.log1p(adata)
#将标准化后的数值存为.raw属性,方便后续分析
>>> adata.raw = adata

鉴定高变基因:

#鉴定高变基因:
>>> sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#可视化高变基因
>>> sc.pl.highly_variable_genes(adata)

上图黑色的点是高变基因,其他基因是灰色点。

#只留下高变基因进行后续分析
>>> adata = adata[:, adata.var.highly_variable]
>>> adata
View of AnnData object with n_obs × n_vars = 2638 × 1838 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p'
#回归每个细胞的总计数和线粒体基因表达百分比的影响,将数据缩放到单位方差。
>>> sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
#scale数据
>>> sc.pp.scale(adata, max_value=10)

降维

PCA降维:

#PCA降维
>>> sc.tl.pca(adata, svd_solver='arpack')
#PCA可视化
>>> sc.pl.pca(adata, color='CST3')

也可以用碎石图来决定用多少个PC来进行临近细胞的计算:

sc.pl.pca_variance_ratio(adata, log=True)

保存结果:

>>> adata.write(results_file)
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

聚类

聚类前先计算neighborhood graph,先用默认值来计算一下:

>>> sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

将neighborhood graph嵌入:

>>> sc.tl.umap(adata)
>>> sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])

如果我将上面的n_pcs=40换成n_pcs=8,(这个值根据上面的碎石图来,选拐点的数字)图就不一样了:

#neighborhood graph聚类
>>> sc.tl.leiden(adata)
>>> sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
>>> adata.write("umap.h5ad")
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

寻找Marker基因

给每一个cluster计算top差异基因:
(1)你可以使用t-test方法计算:

>>> sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
>>> sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False,fontsize=5)
>>> sc.settings.verbosity = 2  # reduce the verbosity

(2)你也可以用另一种方法来计算(推荐):

>>> sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
>>> sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False,fontsize=5)

然后可以看一下随便哪个cluster里的差异基因:

>>> sc.get.rank_genes_groups_df(adata, group="0")
       scores     names  logfoldchanges          pvals      pvals_adj
0   32.783016    S100A9        7.394584  1.028132e-235  8.518724e-232
1   32.777248       LYZ        6.225223  1.242340e-235  8.518724e-232
2   32.260483    S100A8        7.533852  2.508019e-228  1.146499e-224
3   30.617512    TYROBP        5.391633  7.157181e-206  2.453839e-202
4   29.973402      FCN1        5.507009  2.180708e-197  5.981246e-194
..        ...       ...             ...            ...            ...
95  12.487227      LY86        2.596598   8.765488e-36   7.285449e-34
96  12.318965  C20orf24        1.944652   7.160747e-35   5.676444e-33
97  12.233775     CNPY3        2.075463   2.051759e-34   1.617116e-32
98  12.183536     TCEB2        1.488625   3.804261e-34   2.981236e-32
99  12.136163     ASGR1        5.176166   6.793834e-34   5.293786e-32

[100 rows x 5 columns]

查看所有cluster里的top10基因:

>>> pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(10)
        0         1       2      3       4      5  ...       7      8       9      10        11     12
0  S100A9      CD74    LDHB  RPL32   RPS12   CCL5  ...    LST1   IL32    NKG7    NKG7  HLA-DPA1    PF4
1     LYZ     CD79A     LTB  RPS12    RPS6   GZMK  ...  FCER1G   CD3D    GNLY    CCL5  HLA-DPB1  GNG11
2  S100A8   HLA-DRA    CD3D   RPS6   RPL13   NKG7  ...   COTL1   LDHB    GZMB    GZMH  HLA-DRB1   SDPR
3  TYROBP     CD79B    IL32  RPS27   RPS3A   IL32  ...    AIF1    LTB    CTSW    CST7   HLA-DRA   PPBP
4    FCN1  HLA-DPB1  TMEM66   RPS3    RPL3   GZMA  ...    FTH1   CD3E    PRF1     B2M      CD74   NRGN
5     FTL     MS4A1    IL7R  RPS14    RPS3   CTSW  ...  IFITM3    B2M    CST7   HLA-C      CST3  SPARC
6    CST3  HLA-DQA1    JUNB  RPL13   RPL32    B2M  ...    SAT1   RPS3    GZMA   HLA-A  HLA-DRB5   GPX1
7  LGALS2  HLA-DRB1    TPT1  RPL21   RPS14   CST7  ...     FTL  RPS27   HLA-C    GZMA  HLA-DQA1   TPM4
8  S100A6  HLA-DQB1  GIMAP7  RPS25   RPS18  HLA-C  ...    PSAP  RPS25  FGFBP2    CD3D  HLA-DQB1  RGS18
9    FTH1      CD37    CD3E  RPL31  EEF1A1   LYAR  ...    CTSS  HLA-A     B2M  FGFBP2    FCER1A  CALM3

将top10基因和它们的pval同时展示:

>>> result = adata.uns['rank_genes_groups']
>>> groups = result['names'].dtype.names
>>> pd.DataFrame(
...     {group + '_' + key[:1]: result[key][group]
...     for group in groups for key in ['names', 'pvals']}).head(10)
      0_n            0_p       1_n            1_p  ...      11_n          11_p   12_n          12_p
0  S100A9  1.028132e-235      CD74  3.599677e-184  ...  HLA-DPA1  1.097649e-19    PF4  4.722886e-10
1     LYZ  1.242340e-235     CD79A  4.507668e-171  ...  HLA-DPB1  7.563026e-19  GNG11  4.733899e-10
2  S100A8  2.508019e-228   HLA-DRA  5.186291e-168  ...  HLA-DRB1  2.189338e-18   SDPR  4.733899e-10
3  TYROBP  7.157181e-206     CD79B  5.747138e-156  ...   HLA-DRA  6.262321e-18   PPBP  4.744938e-10
4    FCN1  2.180708e-197  HLA-DPB1  1.257935e-147  ...      CD74  2.286726e-17   NRGN  4.800511e-10
5     FTL  7.485109e-192     MS4A1  1.334052e-140  ...      CST3  2.883852e-17  SPARC  4.947990e-10
6    CST3  5.932796e-191  HLA-DQA1  3.534036e-140  ...  HLA-DRB5  6.818970e-17   GPX1  4.947990e-10
7  LGALS2  9.551453e-189  HLA-DRB1  1.510260e-130  ...  HLA-DQA1  1.022928e-16   TPM4  5.159513e-10
8  S100A6  3.253730e-188  HLA-DQB1  6.020741e-130  ...  HLA-DQB1  5.566163e-16  RGS18  5.195614e-10
9    FTH1  4.782225e-180      CD37  6.463741e-130  ...    FCER1A  1.482060e-15  CALM3  6.197000e-10

[10 rows x 26 columns]

你还可以单独比较某两个cluster的差异基因:

#比如cluster 0和1:
>>> sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
>>> sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)

再看一下小提琴图:

>>> sc.pl.rank_genes_groups_violin(adata, groups=['0'], n_genes=20)

重新load数据,可以看cluster 0和其他的cluster的比较:

>>> adata=sc.read("./write/pbmc3k.h5ad")
>>> sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=20)

比较所有cluster里某些特定的基因:

>>> sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')

NOTE:下一步将细胞名称先存成一个列表(这一步官网的cluster数量和我的不一样,所以就先不做),把聚类的细胞注释:

#你有多少个cluster就要输入多少个名称
>>> new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes']
>>> adata.rename_categories('leiden', new_cluster_names)
>>> sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')

可视化每一个cluster的marker基因:

>>> marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
>>> ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')

上面是点图,再画个小提琴图看一下:

>>> ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)

这一部分就告一段落,主要是简单的介绍一下scanpy的数据预处理和聚类。我们可以回顾一下adata这个对象里现在都存了哪些内容:

>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

如果想保存adata对象里某一些内容,存成csv格式:

# Export single fields of the annotation of observations
>>> adata.obs[['n_counts', 'louvain_groups']].to_csv(
     './write/pbmc3k_corrected_louvain_groups.csv')
# Export single columns of the multidimensional annotation
>>> adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
     './write/pbmc3k_corrected_X_pca.csv')
# Or export everything except the data using `.write_csvs`.
# Set `skip_data=False` if you also want to export the data.
>>> adata.write_csvs(results_file[:-5], )

你可能感兴趣的:(利用scanpy进行单细胞测序分析(一)预处理和聚类)