文件格式
可以是最基本的甲基化调用文件也可以是bismark下游的BAM/SAM文件。
在这里使用的是经过处理得到的文本文件。
典型的文件应该是这样的
## chrBase chr base strand coverage freqC freqT
## 1 chr21.9764539 chr21 9764539 R 12 25.00 75.00
## 2 chr21.9764513 chr21 9764513 R 12 0.00 100.00
## 3 chr21.9820622 chr21 9820622 F 13 0.00 100.00
## 4 chr21.9837545 chr21 9837545 F 11 0.00 100.00
## 5 chr21.9849022 chr21 9849022 F 124 72.58 27.42
导入文件
library(methylKit)
file.list <- list(
"brain_1.txt",
"brain_2.txt",
"adrenal_1.txt",
"adrenal_2.txt")
myobj <- methRead(
file.list,
sample.id=list(
"brain_1",
"brain_2",
"adrenal_1",
"adrenal_2"
),
assembly="hg19",
sep = " ",
treatment=c(1,1,0,0),
context="CpG",
mincov = 10
)
1.样本描述性统计
1.1甲基化百分比的分布直方图
一般呈现两边高中间低的特点。要么高甲基化,要么低甲基化。
1.2甲基化覆盖率的直方图
如果数据受到了PCR扩增的影响,那么右边还会出现一个峰。
1.3根据覆盖率对位点进行筛选,去除过高或者过低的位点
2比较分析
2.1对样本进行merge
meth=unite(tile, destrand=FALSE)
head(tile)
得到在所有样本中都有cover的甲基化位点or甲基化区域(这里是甲基化区域)。
methylBase object with 6 rows
--------------
chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4
1 chr1 714501 715000 * 218 3 215 204 6 198 176 5 171 168 3
2 chr1 799001 799500 * 84 78 6 245 217 28 98 68 30 144 106
3 chr1 805001 805500 * 573 26 547 1005 22 983 823 16 807 1219 28
4 chr1 809001 809500 * 76 76 0 56 54 2 46 40 6 77 68
5 chr1 839001 839500 * 132 34 98 161 52 109 157 103 54 254 170
6 chr1 839501 840000 * 281 43 238 374 56 318 327 39 288 508 73
numTs4
1 165
2 38
3 1191
4 9
5 84
6 435
--------------
sample.ids: brain_1 brain_2 adrenal_1 adrenal_2
destranded FALSE
assembly: hg19
context: CpG
treament: 1 1 0 0
resolution: region
2.2样本相关性分析
得到相关系数图
2.3样本聚类信息
2.4 PCA分析(是什么?)
We can also do a PCA analysis on our samples. The following function will plot a scree plot for importance of components.
We can also plot PC1 and PC2 axis and a scatter plot of our samples on those axis which will reveal how they cluster.
2.5找到差异甲基化区域
The calculateDiffMeth() function is the main function to calculate differential methylation. Depending on the sample size per each set it will either use Fisher’s exact or logistic regression to calculate P-values. P-values will be adjusted to Q-values using SLIM method (Wang, Tuominen, and Tsai 2011). If you have replicates, the function will automatically use logistic regression.
calculateDiffMeth()函数是计算差异甲基化的主要函数。根据每个组的样本多少,它将使用Fisher精确检验或逻辑回归检验来计算p值。使用SLIM方法将p值调整为q值(Wang, Tuominen, and Tsai 2011)。如果有重复样本,函数将自动使用逻辑回归检验。
这个原理我也不是很懂。
myDiff=calculateDiffMeth(meth)
# get hyper methylated bases
myDiff25p.hyper=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hyper")
# get hypo methylated bases
myDiff25p.hypo=getMethylDiff(myDiff,difference=25,qvalue=0.01,type="hypo")
# get all differentially methylated bases
myDiff25p=getMethylDiff(myDiff,difference=25,qvalue=0.01)
可视化每条染色体的低甲基化/高甲基化碱基/区域的分布:
3.注释差异甲基化位点或者区域
得到差异甲基化的基因组分布
Summary of target set annotation with genic parts
Rows in target set: 12066
-----------------------
percentage of target features overlapping with annotation:
promoter exon intron intergenic
14.63 26.44 60.27 31.86
percentage of target features overlapping with annotation:
(with promoter > exon > intron precedence):
promoter exon intron intergenic
14.63 16.91 36.61 31.86
percentage of annotation boundaries with feature overlap:
promoter exon intron
3.91 1.36 3.16
summary of distances to the nearest TSS:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 2890 11562 36476 36108 1718029
得到差异甲基化位点的CpG岛分布:
summary of target set annotation with feature annotation:
Rows in target set: 12066
----------------------------
percentage of target elements overlapping with features:
CpGi shores other
66.34 7.09 29.15
percentage of feature elements overlapping with target:
CpGi shores
15.25 1.96
得到距离TSS最近的基因名称
target.row dist.to.feature feature.name feature.strand
57 1 -7315 uc001abu.1 +
92 2 3201 uc031pkn.1 +
64 3 -2155 uc031pko.1 +
64.1 4 -1655 uc031pko.1 +
106 5 1126 uc001ace.3 +
106.1 6 1626 uc001ace.3 +
读取CpG岛(CpGi)和CpG海岸注释时报错了,不过好像没有什么关系,帮助文档里也这么用。
a GenomicRangesList contatining one GRanges object for flanks and one for GRanges object for the main feature.
NOTE: This can not return a GRangesList at the moment because flanking regions do not have to have the same column name as the feature. GRangesList elements should resemble each other in the column content. We can not satisfy that criteria for the flanks
与内含子/外显子/启动子/CpG岛等重叠的差异甲基化区域的百分比/数目
读入甲基化调用文件之后会返回一个methylRawList对象。
但是当样本很多而且样本的规格很大的时候,可以返回不一样的对象
methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB统称为methylDB objects。
Sometimes, when dealing with multiple samples and increased sample sizes coming from genome wide bisulfite sequencing experiments, the memory of your computer might not be sufficient enough.
Therefore methylKit offers a new group of classes, that are basically pendants to the original methylKit classes with one important difference: The methylation information, which normally is internally stored as data.frame, is stored in an external bgzipped file and is indexed by tabix (Li 2011), to enable fast retrieval of records or regions. This group contains methylRawListDB, methylRawDB, methylBaseDB and methylDiffDB, let us call them methylDB objects.
We can now create a methylRawListDB object, which stores the same content as myobj from above. But the single methylRaw objects retrieve their data from the tabix-file linked under dbpath.