先了解背景知识:http://www.bio-info-trainee.com/3263.html 并且下载数据
wget http://genemed.uchicago.edu/~pgeeleher/cgpPrediction/paper.zip
## (161M)
unzip paper.zip
首先了解数据并且读取到R里面查看
代码是:
setwd("pcaAnalysis/")
gdsc_mutDat <- read.csv("../Data/GdscPdata/gdsc_manova_input_w2.csv", as.is=TRUE)
gdsc_mutDat[1:6,1:6]
rownames(gdsc_mutDat) <- gdsc_mutDat[,1]
load(file="../Data/GdscProcData/gdsc_brainarray_syms.RData")
dim(gdsc_brainarray_syms)
gdsc_brainarray_syms[1:6,1:3]
pData <- read.delim("../Data/GdscPdata/E-MTAB-783.sdrf.txt", as.is=TRUE)
pData[1:6,1:4]
记录着 715 个细胞系的一些基因的点突变及拷贝数变异信息,以及若干表型信息,还有大量的药物的IC50值,简单查看如下:
> gdsc_mutDat[1:6,1:6]
Cell.Line Cosmic_ID Tissue Cancer.Type MS.HL AKT2
1 NA NA
2 MC-CAR 683665 Myeloma blood 0 na::0
药物IC50值查看如下:
colsIc50 <- which(regexpr("_IC_50$", colnames(gdsc_mutDat)) > 0)
allIc50 <- gdsc_mutDat[-1, colsIc50]
Ic50FullSet <- data.matrix(allIc50[apply(allIc50, 1, function(r)return(sum(r == ""))) == 0, ])
dim(Ic50FullSet)
Ic50FullSet[1:4,1:4]
只有117个细胞系有着全部的 138个药物处理信息:
> Ic50FullSet[1:4,1:4]
Erlotinib_IC_50 Rapamycin_IC_50 Sunitinib_IC_50 PHA.665752_IC_50
ES3 5.4575 2.712700 3.6244 5.7420
ES5 6.1779 2.713300 3.2082 2.6719
ES7 5.3345 2.920100 5.1787 4.0530
EW-11 3.9545 0.083814 3.5402 5.1912
上面的不是芯片表达矩阵,而是各个药物在各个细胞系的IC50值矩阵。
还有789个细胞系的表型记录(18项)信息:
> pData[1:6,1:4]
Source.Name Material.Type Characteristics.Organism. Characteristics.CellLine.
1 380 cell Homo sapiens 380
2 697 cell Homo sapiens 697
3 5637 cell Homo sapiens 5637
4 22RV1 cell Homo sapiens 22RV1
5 23132-87 cell Homo sapiens 23132-87
6 639-V cell Homo sapiens 639-V
colnames(pData)
[1] "Source.Name" "Material.Type" "Characteristics.Organism."
[4] "Characteristics.CellLine." "Characteristics.DiseaseState." "Protocol.REF"
[7] "Protocol.REF.1" "Extract.Name" "Protocol.REF.2"
[10] "Labeled.Extract.Name" "Label" "Protocol.REF.3"
[13] "Term.Source.REF" "Hybridization.Name" "Array.Design.REF"
[16] "Array.Data.File" "Comment..ArrayExpress.FTP.file." "Factor.Value.CELL_LINE."
以及789个细胞系的芯片表达矩阵:
> dim(gdsc_brainarray_syms)
[1] 12092 789
> gdsc_brainarray_syms[1:6,1:3]
5500024030401071707289.A01.CEL 5500024030401071707289.A02.CEL 5500024030401071707289.A03.CEL
NAT2 4.012023 4.144823 4.445793
ADA 6.153445 7.890772 7.160563
CDH2 5.270275 7.594583 5.327693
AKT3 5.766602 5.632153 4.687145
MED6 7.750406 7.395117 7.311555
NR2E3 4.692599 4.326632 4.214958
>
直接根据全表达矩阵做主成分分析
总共的666个细胞系
pcsAll <- prcomp(t(trainDataOrd))$x
colSup1 <- unclass(factor(cancerTypesOrd))
pchSup1 <- unclass(factor(cancerTypesOrd))
#pdf("suppFig1.pdf", width=12, height=10)
par(mar=c(5.1, 4.1, 4.1, 5.7), xpd=TRUE)
plot(pcsAll, pch=pchSup1, col=colSup1, xlab="Principle component 1", ylab="Principle component 2")
legend("topleft", sapply(as.character(levels(factor(cancerTypesOrd))), makeCapital), col=seq(1:16), pch=seq(1:16), cex=.6, inset=c(1,0))
#dev.off()
抽取血液相关细胞系做主成分分析
总共是88个细胞系,列表如下:
haemTissue Freq
1 AML 15
2 B cell leukemia 7
3 B cell lymphoma 9
4 Burkitt lymphoma 9
5 CML 5
6 Hodgkin lymphoma 4
7 Myeloma 7
8 haematopoietic_neoplasm other 1
9 hairy_cell_leukaemia 1
10 leukemia 2
11 lymphoblastic T cell leukaemia 9
12 lymphoblastic leukemia 10
13 lymphoid_neoplasm other 9
绘图代码是:
pcsHaem <- prcomp(t(trainDataOrd[, which(cancerTypesOrd == "blood")]))$x
haemIndex <- which(cancerTypesOrd == "blood")
haemTissue <- gdsc_mutDat[commonCellLines[haemIndex], "Tissue"]
mycol <- unclass(factor(haemTissue))
mypch <- unclass(factor(haemTissue))
#pdf("suppFig2.pdf", width=12, height=10)
par(mar=c(5.1, 4.1, 4.1, 7.4), xpd=TRUE)
plot(pcsHaem[,c(1,2)], col=mycol, pch=mypch, xlab="Principle component 1", ylab="Principle component 2")
legend("topleft", sapply(as.character(levels(factor(haemTissue))), makeCapital), col=seq(1:13), pch=seq(1:13), cex=.5, inset=c(1,0))
#dev.off()
还可以根据mut信息及CNV信息来抽取样本进行PCA分析并且着色看看该分类是否与主成分分析结果对应。