数据背景接上一篇:Python图文复现2022||01-文献阅读:致命COVID-19分子单细胞肺图谱
数据获取
有三种途径:
- 处理后的数据:single-cell portal: https://singlecell.broadinstitute.org/single_cell/study/SCP1219.
- 处理后的数据还可以在GEO获取:GSE171524
- 原始数据:the Broad Data Use and Oversight System: https://duos.broadinstitute.org (study ID DUOS-000130),但是需要注册,注册要求好像还有丢丢特殊
这次就从GEO下载吧,下载完后:3个文件,一个处理后的csv表达数据,一个metadata,一个原始count数据压缩包tar
原文代码:https://github.com/IzarLab/CUIMC-NYP_COVID_autopsy_lung
但是本次我们使用一个利用这个数据讲python学习的资源,
- 视频如下:https://www.youtube.com/watch?v=uvyG9yLuNSE
视频相关代码如下:
- 代码:https://github.com/mousepixels/sanbomics
- 主代码:https://github.com/mousepixels/sanbomics/blob/main/single_cell_analysis_complete_class.ipynb
数据下载后,开工!
环境准备:
# 使用conda安装好相关软件
conda activate scanpy
# 导入包,注意dir路径改成自己的
import scanpy as sc
dir = '/path/data/GSE171524/GSE171524_RAW/'
数据读取
GSM5226574_C51样本是个肺对照样本,总共包含6099个细胞,34546个基因。
# 读取数据
adata = sc.read_csv(dir + 'GSM5226574_C51ctr_raw_counts.csv').T
adata
#AnnData object with n_obs × n_vars = 6099 × 34546
# 查看数据维度
adata.X.shape
#(6099, 34546)
Doublet过滤
# pip install scvi-tools
import scvi
# 过滤低表达基因以及高变基因选择
sc.pp.filter_genes(adata, min_cells = 10)
sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
# 训练模型
scvi.model.SCVI.setup_anndata(adata)
vae = scvi.model.SCVI(adata)
vae.train()
#GPU available: False, used: False
#TPU available: False, using: 0 TPU cores
#IPU available: False, using: 0 IPUs
#HPU available: False, using: 0 HPUs
#Epoch 400/400: 100%|████████████████████████████████████████████████████████| 400/400 [23:10<00:00, 3.21s/it, loss=320, v_num=1]`Trainer.fit` stopped: `max_epochs=400` reached.
#Epoch 400/400: 100%|████████████████████████████████████████████████████████| 400/400 [23:10<00:00, 3.48s/it, loss=320, v_num=1]
solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()
df = solo.predict()
df['prediction'] = solo.predict(soft = False)
df.index = df.index.map(lambda x: x[:-2])
df
结果:doublet这一列的值越高,表明这个细胞约可能是双包体;
预测结果统计:
有1245个细胞被预测为双包体,占总细胞的20%左右,这对于10X来说,双包率有点太高了
df.groupby('prediction').count()
doublet singlet
prediction
doublet 1245 1245
singlet 4854 4854
因此,这里计算一个df值:
df['dif'] = df.doublet - df.singlet
df
新增一列df值:
给这个df绘制一个分布图:
import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(df[df.prediction == 'doublet'], x = 'dif')
plt.savefig(dir+"01-df-displot.png")
结果图:
因此,将df大于1的预测为双包体:
doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
doublets
adata = sc.read_csv(dir+'GSM5226574_C51ctr_raw_counts.csv').T
adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
adata.obs
预测结果:
去除:还剩余5618个细胞
adata = adata[~adata.obs.doublet]
adata
View of AnnData object with n_obs × n_vars = 5618 × 34546
obs: 'doublet'
预处理
# 计算线粒体基因比例
adata.var['mt'] = adata.var.index.str.startswith('MT-')
adata.var
# 核糖体RNA基因
import pandas as pd
ribo_url = "http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_RIBOSOME&fileType=txt"
ribo_genes = pd.read_table(ribo_url, skiprows=2, header = None)
ribo_genes
# 0
# 0 FAU
# 1 MRPL13
# 2 RPL10
# 3 RPL10A
# 4 RPL10L
# .. ...
# 83 RPS9
# 84 RPSA
# 85 RSL24D1
# 86 RSL24D1P11
# 87 UBA52
# [88 rows x 1 columns]
adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
接着计算qc指标
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
adata.var.sort_values('n_cells_by_counts')
结果如下:
低表达过滤:
sc.pp.filter_genes(adata, min_cells=3)
adata.var.sort_values('n_cells_by_counts')
adata.obs.sort_values('n_genes_by_counts')
# 绘制小提琴图
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'], jitter=0.4, multi_panel=True)
plt.savefig(dir+"02-qc_violin.png")
结果图:
按照分位数来过滤细胞:
import numpy as np
upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98)
#upper_lim = 3000
upper_lim
adata = adata[adata.obs.n_genes_by_counts < upper_lim]
adata.obs
过滤之后:
线粒体比例与核糖体比例:
adata = adata[adata.obs.pct_counts_mt < 20]
adata = adata[adata.obs.pct_counts_ribo < 2]
adata
过滤后:
View of AnnData object with n_obs × n_vars = 5489 × 24080
obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
标准化Normalization
标准化前后的区别可看:adata.X.sum(axis = 1)值的变化
adata.X.sum(axis = 1)
#normalize every cell to 10,000 UMI
sc.pp.normalize_total(adata, target_sum=1e4)
adata.X.sum(axis = 1)
#change to log counts
sc.pp.log1p(adata)
adata.X.sum(axis = 1)
adata.raw = adata
聚类Clustering
# 高变基因选择以及可视化
sc.pp.highly_variable_genes(adata, n_top_genes = 2000)
sc.pl.highly_variable_genes(adata)
plt.savefig(dir+"03-highly_variable_genes.png")
结果图:
选择主成分:
adata = adata[:, adata.var.highly_variable]
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt', 'pct_counts_ribo'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)
plt.savefig(dir+"04-pca_variance.png")
结果图:
这里选择30个PCs,然后聚类:
sc.pp.neighbors(adata, n_pcs = 30)
sc.tl.umap(adata)
sc.pl.umap(adata)
plt.savefig(dir+"05-umap.png")
结果图:
使用leiden在低维空间可视化:
sc.tl.leiden(adata, resolution = 0.5)
sc.pl.umap(adata, color=['leiden'])
plt.savefig(dir+"05-umap_leiden.png")
结果图:聚成了11类
单个样本的分析演示到这里,下期进行所有样本整合分析~