KEGG 是了解高级功能和生物系统(如细胞、 生物和生态系统),从分子水平信息,尤其是大型分子数据集生成的基因组测序和其他高通量实验技术的实用程序数据库资源, 由日本京都大学生物信息学中心的Kanehisa实验室于1995年建立。是国际最常用的生物信息数据库之一,以“理解生物系统的高级功能和实用程序资源库”著称。
小练习:如何拿到 KEGG数据库的 hsa04650 Natural killer cell mediated cytotoxicity(自然杀伤细胞介导的细胞毒性)这个通路的所有基因名字。(hsa04650:Homo sapiens智人)
两种办法,第一谷歌,通过网页方式浏览得到,第二种办法,使用R包和代码来做。
第一种办法:网页浏览
1、谷歌直接搜索:hsa04650
2、点开此条网址( https://www.genome.jp/dbget-bin/www_bget?hsa04650)
3、直接翻到gene这个条目下即可看到答案。
第二种方法:使用R包和代码:
思路:看一下网页答案可知,我们的目标是得到Gene条目形成的一个矩阵,并提取出第二列的基因(缩写)
参考文章: http://www.bio-info-trainee.com/3533.html
看一下这篇文章:
library(clusterProfiler) #加载这个包,这个包有什么用呢?
# https://www.kegg.jp/dbget-bin/www_bget?pathway+hsa05169
# library(KEGG.db) library(KEGGREST) #这两个包有什么用呢?
kg=download_KEGG('hsa') #直接提取,并未提示用哪个命令获得。
head(kg[[1]])
head(kg[[2]])
ps=c('hsa04660','hsa04659',
'hsa04658','hsa04657','hsa04662',
'hsa04650')
- clusterProfiler :This package implements methods to analyze and visualize functional profiles (GO and KEGG) of gene and gene clusters.(该软件包是实现了分析和可视化基因和基因簇的功能谱(GO和KEGG)的方法。)
- KEGGREST :A package that provides a client interface to the KEGG REST server. (一个为KEGG REST服务器提供客户端接口的包。)
确定方向,先安装包:
老规矩三部曲(安装bioconductor内的包):
1、source("http://bioconductor.org/biocLite.R")
安装BiocInstaller
2、options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
切换镜像
3、BiocInstaller::biocLite('KEGGREST')
安装bioconductor内的包(KEGGREST就是bioconductor的包)
> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.7 (BiocInstaller 1.30.0), ?biocLite for help
A newer version of Bioconductor is available for this version of R, ?BiocUpgrade for
help
> options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
> BiocInstaller::biocLite('KEGGREST')
BioC_mirror: http://mirrors.ustc.edu.cn/bioc/
Using Bioconductor 3.7 (BiocInstaller 1.30.0), R 3.5.2 (2018-12-20).
Installing package(s) ‘KEGGREST’
also installing the dependency ‘png’
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/png_0.1-7.zip'
Content type 'application/zip' length 292639 bytes (285 KB)
downloaded 285 KB
trying URL 'http://mirrors.ustc.edu.cn/bioc//packages/3.7/bioc/bin/windows/contrib/3.5/KEGGREST_1.20.2.zip'
Content type 'application/zip' length 124626 bytes (121 KB)
downloaded 121 KB
package ‘png’ successfully unpacked and MD5 sums checked
package ‘KEGGREST’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\300S\AppData\Local\Temp\Rtmp4wKPRV\downloaded_packages
Old packages: 'gplots', 'purrr'
Update all/some/none? [a/s/n]:
a
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gplots_3.0.1.1.zip'
Content type 'application/zip' length 657011 bytes (641 KB)
downloaded 641 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/purrr_0.3.0.zip'
Content type 'application/zip' length 413820 bytes (404 KB)
downloaded 404 KB
package ‘gplots’ successfully unpacked and MD5 sums checked
package ‘purrr’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\300S\AppData\Local\Temp\Rtmp4wKPRV\downloaded_packages
了解包的使用:
命令:
> ?KEGGREST
No documentation for ‘KEGGREST’ in specified packages and libraries:
you could try ‘??KEGGREST’
> ??KEGGREST
点击查看,了解基本命令:
- KEGG exposes a number of databases. To get an idea of what is available,
run listDatabases()
显示KEGGREST所包含的数据内容 - You can obtain the list of organisms available in KEGG with the
keggList()
function 得到可用的生物列表
> gs<-keggGet('hsa04650')
> View(gs)
网页部分截图:
目录和网页一样,但是可以明显看出gs目前不是矩阵。把其变成矩阵再提取出来即可。
光标放在目录旁,发现一个图标,点击出现一行代码,enter运行,得到该目录内容。
与网页对比正确:
- strsplit(x, split, fixed = FALSE, perl= FALSE, useBytes = FALSE)
参数x是要处理的字符串,
参数split是分割点。
参数fixed为TRUE时采用精确查找;
参数perl为TRUE时采用Perl正则表达式;
参数fixed和perl都为FALSE时,使用POSIX1003.2扩展正则表达式;
参数useBytes为TRUE时,匹配过程是逐字节进行的;
lapply(X, FUN, ...)
lapply的返回值是和一个和X有相同的长度的list对象,这个list对象中的每个元素是将函数FUN应用到X的每一个元素。其中X为List对象(该list的每个元素都是一个向量),其他类型的对象会被R通过函数as.list()自动转换为list类型。unlist就是把里面不同的类型的数据分解出来,在此将数字与字符分隔开。unlist(x)生成一个包含x所有元素的向量,作用是展平数据列表。
> lapply(a,function(x) strsplit(x,';'))
[[1]]
[[1]][[1]]
[1] "3105"
[[2]]
[[2]][[1]]
[1] "HLA-A"
[2] " major histocompatibility complex, class I, A [KO:K06751]"
...
> unlist(lapply(a,function(x) strsplit(x,';')[[1]][1]))
[1] "3105" "HLA-A" "3106" "HLA-B" "3107" "HLA-C"
[7] "3135" "HLA-G" "3133" "HLA-E" "3812" "KIR3DL2"
[13] "3811" "KIR3DL1" "3803" "KIR2DL2" "3802" "KIR2DL1"
> b<- unlist(lapply(a,function(x) strsplit(x,';')[[1]][1]))
> b[1:length(b)%%2 ==0] #length(b)为基因所在位置,取出位置为偶数的字符即基因名
[1] "HLA-A" "HLA-B" "HLA-C" "HLA-G" "HLA-E" "KIR3DL2"
[7] "KIR3DL1" "KIR2DL2" "KIR2DL1" "KIR2DL3" "KIR2DL4" "KIR2DL5A"
[13] "KLRC1" "KLRC2" "KLRC3" "KLRD1" "PTPN6" "PTPN11"
[19] "ICAM1" "ICAM2" "ITGAL" "ITGB2" "PTK2B" "VAV3"
[25] "VAV1" "VAV2" "RAC1" "RAC2" "RAC3" "PAK1"
[31] "MAP2K1" "MAP2K2" "MAPK1" "MAPK3" "TNF" "CSF2"
[37] "IFNG" "KIR2DS1" "KIR2DS3" "KIR2DS4" "KIR2DS5" "KIR2DS2"
[43] "NCR2" "TYROBP" "LCK" "IGH" "FCGR3A" "FCGR3B"
[49] "NCR1" "NCR3" "FCER1G" "CD247" "ZAP70" "SYK"
[55] "LCP2" "LAT" "PLCG1" "PLCG2" "SH3BP2" "PIK3CA"
[61] "PIK3CD" "PIK3CB" "PIK3R1" "PIK3R2" "PIK3R3" "FYN"
[67] "SHC1" "SHC2" "SHC3" "SHC4" "GRB2" "SOS1"
[73] "SOS2" "HRAS" "KRAS" "NRAS" "ARAF" "BRAF"
[79] "RAF1" "MICB" "MICA" "ULBP1" "ULBP2" "ULBP3"
[85] "RAET1G" "RAET1L" "RAET1E" "KLRK1" "KLRC4-KLRK1" "HCST"
[91] "CD48" "CD244" "PPP3CA" "PPP3CB" "PPP3CC" "PPP3R1"
[97] "PPP3R2" "NFATC1" "NFATC2" "PRKCA" "PRKCB" "PRKCG"
[103] "SH2D1B" "SH2D1A" "IFNGR1" "IFNGR2" "IFNA1" "IFNA2"
[109] "IFNA4" "IFNA5" "IFNA6" "IFNA7" "IFNA8" "IFNA10"
[115] "IFNA13" "IFNA14" "IFNA16" "IFNA17" "IFNA21" "IFNB1"
[121] "IFNAR1" "IFNAR2" "TNFSF10" "TNFRSF10A" "TNFRSF10B" "FASLG"
[127] "FAS" "GZMB" "PRF1" "CASP3" "BID"
友情阅读推荐:
- 强烈推荐参加生信技能树(爆款入门培训课)全国巡讲 ,课程详情见:https://mp.weixin.qq.com/s/Z9sdxgvFj0XJjYaW_5yHXg 各大城市均有开课,随时随地报名。
- 生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
B站链接:https://m.bilibili.com/space/338686099 - 学徒培养详见:https://mp.weixin.qq.com/s/3jw3_PgZXYd7FomxEMxFmw