上篇处理了最简单的单细胞公共数据,即1个样本对应1个矩阵文件(tsv),读取统计就行了,但是GEO公共数据库中还存在如下的情况:例如GSE178318,15个样本的信息全部混在一块了,针对这种情况,我们怎么统计指定样本的线粒体基因分布,中值基因,中值UMI这些指标,逻辑就是拆分样本,如何拆分,下面会娓娓道来~~~
打开GEO界面,红框处6个样本的数据需要统计:
15个样本的数据见下矩阵:
拆分样本
1、下载3个矩阵文件
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE178nnn/GSE178318/suppl/GSE178318_barcodes.tsv.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE178nnn/GSE178318/suppl/GSE178318_genes.tsv.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE178nnn/GSE178318/suppl/GSE178318_matrix.mtx.gz
barcodes.tsv.gz features.tsv.gz matrix.mtx.gz #将下载下来的3个文件重新命名,命令为mv 原名 新名,方便直接用Read10X函数读取
2、在 Seurat 中,对于 Cellranger 数据的导入,它提供了两个函数 Read10X 和 Read10X_h5。前者用于读取 Cellranger 生成的 3 个文件:barcodes.tsv,features.tsv,matrix.mtx,后者则用于读取生成的 xxxxxx.h5文件,这里我们应用前面一个函数就行
R #linux系统调用R
library(Seurat) #加载seurat软件
a <- Read10X("~/SingleronTest/wuxuan/leo") #读取3个文件
a <- CreateSeuratObject(a) #创建Seurat对象
head([email protected]) #查看读取的数据前6行,结果如下展示
orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject 5565 1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject 5892 1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject 3372 933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject 945 548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject 1565 571
AAACCTGAGGACATTA_COL07_CRC SeuratProject 1865 816
head([email protected], 8) #查看前8行的命令
3、正式拆分,使用tidyverse函数
library(tidyverse) #加载tidyverse
4、拆分脚本
rownames([email protected]) -> [email protected]$rowLeo #反向赋值,增加了1列
head([email protected]) #展示如下
orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject 5565 1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject 5892 1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject 3372 933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject 945 548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject 1565 571
AAACCTGAGGACATTA_COL07_CRC SeuratProject 1865 816
rowLeo
AAACCTGAGAAACCTA_COL07_CRC AAACCTGAGAAACCTA_COL07_CRC
AAACCTGAGACAATAC_COL07_CRC AAACCTGAGACAATAC_COL07_CRC
AAACCTGAGACGCAAC_COL07_CRC AAACCTGAGACGCAAC_COL07_CRC
AAACCTGAGCAGATCG_COL07_CRC AAACCTGAGCAGATCG_COL07_CRC
AAACCTGAGCTATGCT_COL07_CRC AAACCTGAGCTATGCT_COL07_CRC
AAACCTGAGGACATTA_COL07_CRC AAACCTGAGGACATTA_COL07_CRC
str_split(a$rowLeo,"_") #拆的结果如下展示
[[99997]]
[1] "CCGTACTAGGGTATCG" "COL17" "LM"
[[99998]]
[1] "CCGTACTCAAGCCGCT" "COL17" "LM"
[[99999]]
[1] "CCGTACTCAAGGGTCA" "COL17" "LM"
unlist(str_split(a$rowLeo,"_")) #拆的结果如下展示
[99988] "AACTCAGCAAGGTTTC" "COL12" "CRC"
[99991] "AACTCAGCACTCAGGC" "COL12" "CRC"
[99994] "AACTCAGGTATAGTAG" "COL12" "CRC"
[99997] "AACTCAGGTCATATCG" "COL12" "CRC"
[ reached getOption("max.print") -- omitted 320844 entries ]
head(str_split(a$rowLeo,"_",simplify=T)) #展示前6行
[,1] [,2] [,3]
[1,] "AAACCTGAGAAACCTA" "COL07" "CRC"
[2,] "AAACCTGAGACAATAC" "COL07" "CRC"
[3,] "AAACCTGAGACGCAAC" "COL07" "CRC"
[4,] "AAACCTGAGCAGATCG" "COL07" "CRC"
[5,] "AAACCTGAGCTATGCT" "COL07" "CRC"
[6,] "AAACCTGAGGACATTA" "COL07" "CRC"
head(str_split(a$rowLeo,"_",simplify=T) [,2]) #展示[,2]的前6行
[1] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
str_split(a$rowLeo,"_",simplify=T) [,2] -> [email protected]$Sample #反向赋值
str_split(a$rowLeo,"_",simplify=T) [,3] -> [email protected]$Tissue #反向赋值
head([email protected])
orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject 5565 1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject 5892 1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject 3372 933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject 945 548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject 1565 571
AAACCTGAGGACATTA_COL07_CRC SeuratProject 1865 816
rowLeo Sample Tissue
AAACCTGAGAAACCTA_COL07_CRC AAACCTGAGAAACCTA_COL07_CRC COL07 CRC
AAACCTGAGACAATAC_COL07_CRC AAACCTGAGACAATAC_COL07_CRC COL07 CRC
AAACCTGAGACGCAAC_COL07_CRC AAACCTGAGACGCAAC_COL07_CRC COL07 CRC
AAACCTGAGCAGATCG_COL07_CRC AAACCTGAGCAGATCG_COL07_CRC COL07 CRC
AAACCTGAGCTATGCT_COL07_CRC AAACCTGAGCTATGCT_COL07_CRC COL07 CRC
AAACCTGAGGACATTA_COL07_CRC AAACCTGAGGACATTA_COL07_CRC COL07 CRC
tail([email protected])
orig.ident nCount_RNA nFeature_RNA
TTTGTCAGTTCCACGG_COL18_PBMC SeuratProject 5578 1346
TTTGTCAGTTGGGACA_COL18_PBMC SeuratProject 3485 1192
TTTGTCATCATGTGGT_COL18_PBMC SeuratProject 4894 1445
TTTGTCATCCGCATCT_COL18_PBMC SeuratProject 4051 1111
TTTGTCATCGGAATCT_COL18_PBMC SeuratProject 2617 1018
TTTGTCATCGGTTAAC_COL18_PBMC SeuratProject 2976 1235
rowLeo Sample Tissue
TTTGTCAGTTCCACGG_COL18_PBMC TTTGTCAGTTCCACGG_COL18_PBMC COL18 PBMC
TTTGTCAGTTGGGACA_COL18_PBMC TTTGTCAGTTGGGACA_COL18_PBMC COL18 PBMC
TTTGTCATCATGTGGT_COL18_PBMC TTTGTCATCATGTGGT_COL18_PBMC COL18 PBMC
TTTGTCATCCGCATCT_COL18_PBMC TTTGTCATCCGCATCT_COL18_PBMC COL18 PBMC
TTTGTCATCGGAATCT_COL18_PBMC TTTGTCATCGGAATCT_COL18_PBMC COL18 PBMC
TTTGTCATCGGTTAAC_COL18_PBMC TTTGTCATCGGTTAAC_COL18_PBMC COL18 PBMC
5、按样本(Sample)提取
Leo <- SplitObject(a,split.by='Sample')
Leo
$COL07
An object of class Seurat
33694 features across 33100 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL12
An object of class Seurat
33694 features across 26310 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL15
An object of class Seurat
33694 features across 13950 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL16
An object of class Seurat
33694 features across 14200 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL17
An object of class Seurat
33694 features across 25735 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL18
An object of class Seurat
33694 features across 26986 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
6、按组织(Tissue)提取
Xuan <- SplitObject(a,split.by='Tissue')
Xuan
$CRC
An object of class Seurat
33694 features across 55735 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$LM
An object of class Seurat
33694 features across 57596 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$PBMC
An object of class Seurat
33694 features across 26950 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
6、因为只要属于CRC的6个样本,展示下文本结果便于理解
head([email protected])
orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject 5565 1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject 5892 1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject 3372 933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject 945 548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject 1565 571
AAACCTGAGGACATTA_COL07_CRC SeuratProject 1865 816
rowLeo Sample Tissue
AAACCTGAGAAACCTA_COL07_CRC AAACCTGAGAAACCTA_COL07_CRC COL07 CRC
AAACCTGAGACAATAC_COL07_CRC AAACCTGAGACAATAC_COL07_CRC COL07 CRC
AAACCTGAGACGCAAC_COL07_CRC AAACCTGAGACGCAAC_COL07_CRC COL07 CRC
AAACCTGAGCAGATCG_COL07_CRC AAACCTGAGCAGATCG_COL07_CRC COL07 CRC
AAACCTGAGCTATGCT_COL07_CRC AAACCTGAGCTATGCT_COL07_CRC COL07 CRC
AAACCTGAGGACATTA_COL07_CRC AAACCTGAGGACATTA_COL07_CRC COL07 CRC
head([email protected]$Sample,100000) #查看CRC里面的Sample
[55693] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55702] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55711] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55720] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55729] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
table([email protected]$Sample)
COL07 COL12 COL15 COL16 COL17 COL18
17000 10750 4250 6000 9035 8700
head([email protected]$Sample,17001)
[16966] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
[16975] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
[16984] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
[16993] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL12"
7、第二次拆分
Xuan2 <- SplitObject(Xuan$CRC,split.by='Sample')
Xuan2
$COL07
An object of class Seurat
33694 features across 17000 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL12
An object of class Seurat
33694 features across 10750 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL15
An object of class Seurat
33694 features across 4250 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL16
An object of class Seurat
33694 features across 6000 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL17
An object of class Seurat
33694 features across 9035 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
$COL18
An object of class Seurat
33694 features across 8700 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
统计数据
至此拆分结束,接下来就到了统计线粒体基因分布,中值基因,中值UMI这些指标的时候了
Xuan2$COL12 -> COL12 #反向赋值
median(COL12 @meta.data$nFeature_RNA)
[1] 804
median([email protected]$nCount_RNA)
[1] 2180
COL12[["percent.mt"]] <- PercentageFeatureSet( COL12, pattern = "^MT-")
median([email protected]$percent.mt)
[1] 3.362562
MT5 <- subset(COL12,percent.mt < 5)
MT5
An object of class Seurat
33694 features across 8608 samples within 1 assay
Active assay: RNA (33694 features, 0 variable features)
OK,大工告成,你学废了吗?