生信小白开始接活——文本处理(二)

上篇处理了最简单的单细胞公共数据,即1个样本对应1个矩阵文件(tsv),读取统计就行了,但是GEO公共数据库中还存在如下的情况:例如GSE178318,15个样本的信息全部混在一块了,针对这种情况,我们怎么统计指定样本线粒体基因分布,中值基因,中值UMI这些指标,逻辑就是拆分样本,如何拆分,下面会娓娓道来~~~

1.png

打开GEO界面,红框处6个样本的数据需要统计:
2.png

15个样本的数据见下矩阵:
3.png

拆分样本

1、下载3个矩阵文件

wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE178nnn/GSE178318/suppl/GSE178318_barcodes.tsv.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE178nnn/GSE178318/suppl/GSE178318_genes.tsv.gz
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE178nnn/GSE178318/suppl/GSE178318_matrix.mtx.gz
barcodes.tsv.gz   features.tsv.gz  matrix.mtx.gz #将下载下来的3个文件重新命名,命令为mv 原名 新名,方便直接用Read10X函数读取

2、在 Seurat 中,对于 Cellranger 数据的导入,它提供了两个函数 Read10X 和 Read10X_h5。前者用于读取 Cellranger 生成的 3 个文件:barcodes.tsv,features.tsv,matrix.mtx,后者则用于读取生成的 xxxxxx.h5文件,这里我们应用前面一个函数就行

R #linux系统调用R
library(Seurat) #加载seurat软件
a <- Read10X("~/SingleronTest/wuxuan/leo") #读取3个文件
a <- CreateSeuratObject(a) #创建Seurat对象
head([email protected]) #查看读取的数据前6行,结果如下展示
 orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject       5565         1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject       5892         1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject       3372          933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject        945          548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject       1565          571
AAACCTGAGGACATTA_COL07_CRC SeuratProject       1865          816
head([email protected], 8) #查看前8行的命令

3、正式拆分,使用tidyverse函数

library(tidyverse) #加载tidyverse

4、拆分脚本

rownames([email protected]) -> [email protected]$rowLeo #反向赋值,增加了1列
head([email protected]) #展示如下
                              orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject       5565         1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject       5892         1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject       3372          933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject        945          548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject       1565          571
AAACCTGAGGACATTA_COL07_CRC SeuratProject       1865          816
                                               rowLeo
AAACCTGAGAAACCTA_COL07_CRC AAACCTGAGAAACCTA_COL07_CRC
AAACCTGAGACAATAC_COL07_CRC AAACCTGAGACAATAC_COL07_CRC
AAACCTGAGACGCAAC_COL07_CRC AAACCTGAGACGCAAC_COL07_CRC
AAACCTGAGCAGATCG_COL07_CRC AAACCTGAGCAGATCG_COL07_CRC
AAACCTGAGCTATGCT_COL07_CRC AAACCTGAGCTATGCT_COL07_CRC
AAACCTGAGGACATTA_COL07_CRC AAACCTGAGGACATTA_COL07_CRC
str_split(a$rowLeo,"_")  #拆的结果如下展示
[[99997]]
[1] "CCGTACTAGGGTATCG" "COL17"            "LM"              

[[99998]]
[1] "CCGTACTCAAGCCGCT" "COL17"            "LM"              

[[99999]]
[1] "CCGTACTCAAGGGTCA" "COL17"            "LM"   

unlist(str_split(a$rowLeo,"_")) #拆的结果如下展示 
[99988] "AACTCAGCAAGGTTTC" "COL12"            "CRC"             
[99991] "AACTCAGCACTCAGGC" "COL12"            "CRC"             
[99994] "AACTCAGGTATAGTAG" "COL12"            "CRC"             
[99997] "AACTCAGGTCATATCG" "COL12"            "CRC"             
 [ reached getOption("max.print") -- omitted 320844 entries ]
head(str_split(a$rowLeo,"_",simplify=T)) #展示前6行
     [,1]               [,2]    [,3] 
[1,] "AAACCTGAGAAACCTA" "COL07" "CRC"
[2,] "AAACCTGAGACAATAC" "COL07" "CRC"
[3,] "AAACCTGAGACGCAAC" "COL07" "CRC"
[4,] "AAACCTGAGCAGATCG" "COL07" "CRC"
[5,] "AAACCTGAGCTATGCT" "COL07" "CRC"
[6,] "AAACCTGAGGACATTA" "COL07" "CRC"
head(str_split(a$rowLeo,"_",simplify=T) [,2]) #展示[,2]的前6行
[1] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
str_split(a$rowLeo,"_",simplify=T) [,2] -> [email protected]$Sample #反向赋值
str_split(a$rowLeo,"_",simplify=T) [,3] -> [email protected]$Tissue #反向赋值
head([email protected])
                              orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject       5565         1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject       5892         1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject       3372          933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject        945          548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject       1565          571
AAACCTGAGGACATTA_COL07_CRC SeuratProject       1865          816
                                               rowLeo Sample Tissue
AAACCTGAGAAACCTA_COL07_CRC AAACCTGAGAAACCTA_COL07_CRC  COL07    CRC
AAACCTGAGACAATAC_COL07_CRC AAACCTGAGACAATAC_COL07_CRC  COL07    CRC
AAACCTGAGACGCAAC_COL07_CRC AAACCTGAGACGCAAC_COL07_CRC  COL07    CRC
AAACCTGAGCAGATCG_COL07_CRC AAACCTGAGCAGATCG_COL07_CRC  COL07    CRC
AAACCTGAGCTATGCT_COL07_CRC AAACCTGAGCTATGCT_COL07_CRC  COL07    CRC
AAACCTGAGGACATTA_COL07_CRC AAACCTGAGGACATTA_COL07_CRC  COL07    CRC
tail([email protected]) 
                               orig.ident nCount_RNA nFeature_RNA
TTTGTCAGTTCCACGG_COL18_PBMC SeuratProject       5578         1346
TTTGTCAGTTGGGACA_COL18_PBMC SeuratProject       3485         1192
TTTGTCATCATGTGGT_COL18_PBMC SeuratProject       4894         1445
TTTGTCATCCGCATCT_COL18_PBMC SeuratProject       4051         1111
TTTGTCATCGGAATCT_COL18_PBMC SeuratProject       2617         1018
TTTGTCATCGGTTAAC_COL18_PBMC SeuratProject       2976         1235
                                                 rowLeo Sample Tissue
TTTGTCAGTTCCACGG_COL18_PBMC TTTGTCAGTTCCACGG_COL18_PBMC  COL18   PBMC
TTTGTCAGTTGGGACA_COL18_PBMC TTTGTCAGTTGGGACA_COL18_PBMC  COL18   PBMC
TTTGTCATCATGTGGT_COL18_PBMC TTTGTCATCATGTGGT_COL18_PBMC  COL18   PBMC
TTTGTCATCCGCATCT_COL18_PBMC TTTGTCATCCGCATCT_COL18_PBMC  COL18   PBMC
TTTGTCATCGGAATCT_COL18_PBMC TTTGTCATCGGAATCT_COL18_PBMC  COL18   PBMC
TTTGTCATCGGTTAAC_COL18_PBMC TTTGTCATCGGTTAAC_COL18_PBMC  COL18   PBMC

5、按样本(Sample)提取

Leo <- SplitObject(a,split.by='Sample') 
Leo
$COL07
An object of class Seurat 
33694 features across 33100 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL12
An object of class Seurat 
33694 features across 26310 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL15
An object of class Seurat 
33694 features across 13950 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL16
An object of class Seurat 
33694 features across 14200 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL17
An object of class Seurat 
33694 features across 25735 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL18
An object of class Seurat 
33694 features across 26986 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

6、按组织(Tissue)提取

Xuan <- SplitObject(a,split.by='Tissue') 
Xuan
$CRC
An object of class Seurat 
33694 features across 55735 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$LM
An object of class Seurat 
33694 features across 57596 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$PBMC
An object of class Seurat 
33694 features across 26950 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

6、因为只要属于CRC的6个样本,展示下文本结果便于理解

head([email protected])
                              orig.ident nCount_RNA nFeature_RNA
AAACCTGAGAAACCTA_COL07_CRC SeuratProject       5565         1241
AAACCTGAGACAATAC_COL07_CRC SeuratProject       5892         1845
AAACCTGAGACGCAAC_COL07_CRC SeuratProject       3372          933
AAACCTGAGCAGATCG_COL07_CRC SeuratProject        945          548
AAACCTGAGCTATGCT_COL07_CRC SeuratProject       1565          571
AAACCTGAGGACATTA_COL07_CRC SeuratProject       1865          816
                                               rowLeo Sample Tissue
AAACCTGAGAAACCTA_COL07_CRC AAACCTGAGAAACCTA_COL07_CRC  COL07    CRC
AAACCTGAGACAATAC_COL07_CRC AAACCTGAGACAATAC_COL07_CRC  COL07    CRC
AAACCTGAGACGCAAC_COL07_CRC AAACCTGAGACGCAAC_COL07_CRC  COL07    CRC
AAACCTGAGCAGATCG_COL07_CRC AAACCTGAGCAGATCG_COL07_CRC  COL07    CRC
AAACCTGAGCTATGCT_COL07_CRC AAACCTGAGCTATGCT_COL07_CRC  COL07    CRC
AAACCTGAGGACATTA_COL07_CRC AAACCTGAGGACATTA_COL07_CRC  COL07    CRC
head([email protected]$Sample,100000)  #查看CRC里面的Sample
[55693] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55702] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55711] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55720] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
[55729] "COL18" "COL18" "COL18" "COL18" "COL18" "COL18" "COL18"
table([email protected]$Sample)
COL07 COL12 COL15 COL16 COL17 COL18 
17000 10750  4250  6000  9035  8700 
head([email protected]$Sample,17001)
[16966] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
[16975] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
[16984] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07"
[16993] "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL07" "COL12"

7、第二次拆分

Xuan2 <- SplitObject(Xuan$CRC,split.by='Sample') 
Xuan2
$COL07
An object of class Seurat 
33694 features across 17000 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL12
An object of class Seurat 
33694 features across 10750 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL15
An object of class Seurat 
33694 features across 4250 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL16
An object of class Seurat 
33694 features across 6000 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL17
An object of class Seurat 
33694 features across 9035 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

$COL18
An object of class Seurat 
33694 features across 8700 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

统计数据

至此拆分结束,接下来就到了统计线粒体基因分布,中值基因,中值UMI这些指标的时候了

Xuan2$COL12 -> COL12 #反向赋值
median(COL12 @meta.data$nFeature_RNA)
[1] 804
median([email protected]$nCount_RNA)
[1] 2180
COL12[["percent.mt"]] <- PercentageFeatureSet( COL12, pattern = "^MT-")
median([email protected]$percent.mt)
[1] 3.362562
MT5 <- subset(COL12,percent.mt < 5)
MT5
An object of class Seurat 
33694 features across 8608 samples within 1 assay 
Active assay: RNA (33694 features, 0 variable features)

OK,大工告成,你学废了吗?

你可能感兴趣的:(生信小白开始接活——文本处理(二))