期刊:
Nucleic Acids Research (19.160/Q1)
POSTAR2: deciphering the post-transcriptional regulatory logics
POSTAR2:破译转录后调控逻辑
ABSTRACT
Post-transcriptional regulation of RNAs is critical to the diverse range of cellular processes. The volume of functional genomic data focusing on post-transcriptional regulation logics continues to grow in recent years. In the current database version, POSTAR2 (http://lulab.life.tsinghua.edu.cn/postar), we included the following new features and data: updated ∼500 CLIP-seq datasets (∼1200 CLIPseq datasets in total) from six species, including human, mouse, fly, worm, Arabidopsis and yeast; added a new module ‘Translatome’, which is derived from Ribo-seq datasets and contains ∼36 million open reading frames (ORFs) in the genomes from the six species; updated and unified post-transcriptional regulation and variation data. Finally, we improved web interfaces for searching and visualizing protein–RNA interactions with multi-layer information. Meanwhile, we also merged our CLIPdb database into POSTAR2. POSTAR2 will help researchers investigate the post-transcriptional regulatory logics coordinated by RNA-binding proteins and translational landscape of cellular RNAs.
RNAs的转录后调控对多种多样的细胞过程至关重要。近年来,专注于转录后调控逻辑的功能基因组数据量持续增长。在目前的数据库版本中,POSTAR2(http://lulab.life.tsinghua.edu.cn/postar),我们包括以下新的功能和数据。更新了来自人类、小鼠、苍蝇、蠕虫、拟南芥和酵母等6个物种的500多个CLIP-seq数据集(共1200多个CLIP-seq数据集);增加了一个新的模块 "Translatome",它来自Ribo-seq数据集,包含6个物种基因组中的3600万个开放阅读框(ORF);更新并统一了转录后调控和变异数据。最后,我们改进了网络界面,用于搜索和可视化具有多层信息的蛋白质-RNA相互作用。同时,我们还将我们的CLIPdb数据库合并到POSTAR2中。POSTAR2将帮助研究人员研究由RNA结合蛋白协调的转录后调控逻辑和细胞RNA的翻译景观。
INTRODUCTION
RNA-binding proteins (RBPs) control every aspect of posttranscriptional regulatory logics, including maturation, localization, degradation, modification, editing and translation of cellular RNAs (1–3). Several high-throughput sequencing technologies exist for determining RBP-binding sites and translational dynamics in vivo, most notably ultraviolet crosslinking followed by immunoprecipitation and sequencing (CLIP-seq) (4,5) and ribosome profiling (Riboseq) (6). In recent years, CLIP-seq and Ribo-seq have been widely used to decipher the post-transcriptional regulatory logics coordinated by RBPs and translational landscape of cellular RNAs in various species.
RNA结合蛋白(RBPs)控制转录后调控逻辑的各个方面,包括细胞RNA的成熟、定位、降解、修饰、编辑和翻译(1-3)。有几种高通量测序技术可用于确定RBP结合位点和体内翻译动态,最主要的是紫外线交联后免疫沉淀和测序(CLIP-seq)(4,5)和核糖体分析(Riboseq)(6)。近年来,CLIP-seq和Ribo-seq已被广泛用于破译各种物种中由RBPs协调的转录后调控逻辑和细胞RNA的翻译景观。
CLIP-seq studies have identified RBP-binding sites from a broad set of cell and tissue types from various species (7,8). In addition, large amounts of gene expression profiles, RNA modification sites, RNA editing sites, as well as disease-associated variants, have been identified attributed to efforts on large-scale genomics studies and development of bioinformatics algorithm. The regulatory mechanisms of RBP-binding sites underlie diseases and phenotypes can be revealed by combining information from RBP binding, other post-transcriptional regulatory events and genomic variation. Ribo-seq is a powerful technology for measuring translation efficiency by mapping the ribosome-binding positions across the transcriptome at a sub-codon resolution (6,9). Previous studies have shown that translation efficiency and translational dynamics can be regulated by RBP binding (2,10,11). However, the integration of these largescale datasets for the exploration of the coupling between post-transcriptional and translational regulation remains a great challenge.
CLIP-seq研究已经确定了来自不同物种的广泛的细胞和组织类型的RBP结合点(7,8)。此外,由于大规模基因组学研究和生物信息学算法的发展,大量的基因表达谱、RNA修饰位点、RNA编辑位点以及与疾病相关的变体已被确定。通过结合RBP结合、其他转录后调控事件和基因组变异的信息,可以发现疾病和表型背后的RBP结合点的调节机制。Ribo-seq是一项强大的技术,通过在亚密码子分辨率下绘制整个转录组的核糖体结合位置来测量翻译效率(6,9)。以前的研究表明,翻译效率和翻译动态可以由RBP结合来调节(2,10,11)。然而,整合这些大规模的数据集以探索转录后和翻译调节之间的耦合仍然是一个巨大的挑战。
Here, we developed POSTAR2 by systematically identifying RBP-binding sites derived from more CLIP-seq datasets, and predicting open reading frames (ORFs) using larger-scale Ribo-seq datasets from six species, including human, mouse, fly, worm, Arabidopsis and yeast. POSTAR2 provides an updated interactive user interface for searching and visualizing RNA–protein interactions and ORFs from various tissue types, cell lines, developmental stages and conditions. Moreover, by integrating microRNA (miRNA)-binding sites, RNA modifications sites, RNA editing sites, single nucleotide polymorphisms (SNPs), genome-wide association study (GWAS) variants and cancer somatic mutations, POSTAR2 can be used to explore the potential associations between RBP-binding sites and these data. POSTAR2 made significant improvements in data collection from more species, and could be useful for investigating the post-transcriptional regulatory logics coordinated by RBPs, as well as translational landscape of cellular RNAs.
在这里,我们通过系统地识别来自更多CLIP-seq数据集的RBP结合点来开发POSTAR2,并使用来自六个物种(包括人类、小鼠、苍蝇、蠕虫、拟南芥和酵母)的更大规模的Ribo-seq数据集预测开放阅读框架(ORFs)。POSTAR2提供了一个更新的交互式用户界面,用于搜索和可视化来自不同组织类型、细胞系、发育阶段和条件的RNA-蛋白质相互作用和ORFs。此外,通过整合微RNA(miRNA)结合位点、RNA修饰位点、RNA编辑位点、单核苷酸多态性(SNPs)、全基因组关联研究(GWAS)变体和癌症体细胞突变,POSTAR2可用于探索RBP结合位点与这些数据之间的潜在关联。POSTAR2在更多物种的数据收集方面取得了重大改进,可用于调查由RBPs协调的转录后调控逻辑,以及细胞RNA的翻译景观。
DATA COLLECTION AND PROCESSING
Collection of CLIP-seq datasets
数据收集和处理
收集CLIP-seq数据集
POSTAR was developed to house and distribute RBPbinding sites from human and mouse (12). To expand and update our database, we manually collected newly published CLIP-seq data from the Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) databases (13). At present, POSTAR2 contains a large set of RBP-binding sites derived from CLIP-seq datasets and covers six species, including human, mouse, worm, fly, Arabidopsis and yeast (Figure 1 and Table 1). We first obtained the processed datasets in human and mouse from POSTAR (12), and the processed datasets in worm and yeast from CLIPdb (7). In addition, we collected 298 new datasets of the six species from recent publications. We also updated 332 eCLIP-seq datasets released by the ENCODE consortium (14,15). In total, POSTAR2 contains 1160 CLIP-seq datasets, which cover 284 RBPs from six species (Figure 2A). To our knowledge, this is the largest collection of RBP-binding sites identified from various CLIP-seq technologies, including HITSCLIP, PAR-CLIP, iCLIP, eCLIP and PIP-seq (Supplementary File S1 and Supplementary File S2).
POSTAR是为了容纳和分配人类和小鼠的RBP结合点而开发的(12)。为了扩大和更新我们的数据库,我们从基因表达总库(GEO)和序列阅读档案(SRA)数据库中手动收集新发表的CLIP-seq数据(13)。目前,POSTAR2包含了一大批来自CLIP-seq数据集的RBP结合位点,涵盖了六个物种,包括人类、小鼠、蠕虫、苍蝇、拟南芥和酵母(图1和表1)。我们首先从POSTAR(12)获得了人类和小鼠的处理数据集,从CLIPdb(7)获得了蠕虫和酵母的处理数据集。此外,我们从最近的出版物中收集了六个物种的298个新数据集。我们还更新了ENCODE联盟发布的332个eCLIP-seq数据集(14,15)。总的来说,POSTAR2包含1160个CLIP-seq数据集,其中涵盖了六个物种的284个RBPs(图2A)。据我们所知,这是从各种CLIP-seq技术中发现的最大的RBP结合点集合,包括HITSCLIP、PAR-CLIP、iCLIP、eCLIP和PIP-seq(补充文件S1和补充文件S2)。
Figure 1. Framework to construct POSTAR2 database. (A) POSTAR2 covers six species including human, mouse, fly, worm, Arabidopsis and yeast. (B) POSTAR2 provides three modules: (i) ‘RBP’ module, which provides annotations and functions of RBPs, as well as RBP-binding sites; (ii) ‘RNA’ module, consisting of several sub-modules including ‘Binding sites’, ‘Crosstalk’, ‘Variation’ and ‘Disease’, which annotates the RBP-binding sites using various regulatory events and genomic variants; (iii) ‘Translatome’ module, which aims for exploring the translation landscape of genes across different tissues and cell lines. (C) POSTAR2 provides a user-friendly interface for searching and visualization such as table views, network views, histograms and heatmaps.
图1. 构建POSTAR2数据库的框架。(A) POSTAR2涵盖六个物种,包括人类、小鼠、苍蝇、蠕虫、拟南芥和酵母。(B)POSTAR2提供三个模块:(i)"RBP "模块,提供RBP的注释和功能,以及RBP结合位点;(ii)"RNA "模块,由几个子模块组成,包括 "结合位点"、"串扰"、"变异 "和 "疾病",利用各种调节事件和基因组变异来注释RBP结合位点;(iii)"翻译组 "模块,旨在探索不同组织和细胞系中基因的翻译状况。(C) POSTAR2为搜索和可视化提供了一个友好的界面,如表视图、网络视图、直方图和热图。
Identification of RBP-binding sites
RBP结合点的鉴定
For the newly collected CLIP-seq datasets, we used the uniform preprocessing pipeline from CLIPdb (7) to preprocess the raw data. Briefly, we first trimmed the adaptor sequences from the raw reads using FASTX-Toolkit package (http://hannonlab.cshl.edu/fastx toolkit). We only retained reads with quality score above 20 in 80% of their nucleotides. The reads shorter than 13 nt after adaptor trimming were discarded. Finally, we collapsed identical reads to minimize polymerase chain reaction duplicates.
对于新收集的CLIP-seq数据集,我们使用CLIPdb(7)的统一预处理管道来预处理原始数据。简而言之,我们首先使用FASTX-Toolkit软件包(http://hannonlab.cshl.edu/fastx toolkit)从原始读数中修剪出适应体序列。我们只保留了80%的核苷酸质量得分在20分以上的读数。修剪适应体后短于13 nt的读数被丢弃。最后,我们折叠了相同的读数,以减少聚合酶链式反应的重复。
After preprocessing, the retained reads were aligned to their respective genomes using Bowtie (16) and NovoAlign (http://www.novocraft.com). Notably, to make the genomic coordinates of the binding sites consistent between the newly collected data and available data in POSTAR, we used the same genome versions in POSTAR for read alignment, i.e. human (hg19) and mouse (mm10), together with the genomes for four additional species, i.e. worm (ws235), yeast (R64-1-1), fly (dmel-r6.18) and Arabidopsis (TAIR10). We then used both CLIP technology-specific and nonspecific tools to identify binding sites for each dataset, respectively. Briefly, we used Piranha (17) to identify binding sites for HITS-CLIP, PAR-CLIP and iCLIP datasets with parameter -b 20 -d ZeroTruncatedNegativeBinomial -p 0.01. We also applied CLIP technology-specific tools for binding site identification with default parameters: using PARalyzer (18) for PAR-CLIP datasets, using CIMS (19) for HITS-CLIP datasets and using CITS (a module in CIMS software) (19,20) for iCLIP datasets. The binding site coordinates from HITS-CLIP, PAR-CLIP, iCLIP and PIP-seq, which are human genome hg19-based, were converted to hg38 using the UCSC liftOver tool. As for eCLIP, the hg38-based binding sites were directly downloaded from the ENCODE data portal (https://www.encodeproject.org/, NOV 2017). Finally, we identified millions of RBP-binding sites, and visualized the RBP–RNA interaction network in human (Figure 2B).
预处理后,使用Bowtie(16)和NovoAlign(http://www.novocraft.com)将保留的读数与它们各自的基因组进行比对。值得注意的是,为了使新收集的数据和POSTAR中的可用数据之间的结合点的基因组坐标一致,我们使用POSTAR中相同的基因组版本进行读数比对,即人类(hg19)和小鼠(mm10),以及另外四个物种的基因组,即蠕虫(ws235),酵母(R64-1-1),苍蝇(dmel-r6.18)和拟南芥(TAIR10)。然后,我们分别使用CLIP技术的特异性和非特异性工具来确定每个数据集的结合点。简单地说,我们使用Piranha(17)来识别HITS-CLIP、PAR-CLIP和iCLIP数据集的结合位点,参数为-b 20 -d ZeroTruncatedNegativeBinomial -p 0.01。我们还应用CLIP技术的特定工具进行默认参数的结合点鉴定:对PAR-CLIP数据集使用PARalyzer(18),对HITS-CLIP数据集使用CIMS(19),对iCLIP数据集使用CITS(CIMS软件的一个模块)(19,20)。来自HITS-CLIP、PAR-CLIP、iCLIP和PIP-seq的结合点坐标是基于人类基因组hg19的,使用UCSC liftOver工具转换为hg38。至于eCLIP,基于hg38的结合点是直接从ENCODE数据门户下载的(https://www.encodeproject.org/,2017年11月)。最后,我们确定了数百万个RBP结合位点,并将人类的RBP-RNA相互作用网络可视化(图2B)。
Annotation of RBPs and RBP-binding sites
RBPs和RBP结合点的注释
For each RBP, we obtained the information of RNAbinding domains from Pfam database (21). We also collected GO term annotations of RBPs from AmiGO (22). We annotated RBP-binding sites using their respective genome annotations (human, Gencode V27; mouse, Gencode VM7; fly, Flybase dmel-r6.18; worm, WormBase ws235; Arabidopsis, TAIR10; yeast, SGD R64-1-1) (23–27). To enable systematic annotation of RBP-binding sites in long noncoding RNAs (lncRNAs), we used lncRNA annotations from Gencode (23) for human and mouse, and lncRNA annotations from NONCODE 2016 (28) for fly, worm, Arabidopsis and yeast. The distribution of genomic elements for RBP-binding sites showed difference between species (Figure 2C). We found that human and mouse exhibited similar patterns of genomic elements, suggesting the conservation of functional RBP binding between mammals.
对于每个RBP,我们从Pfam数据库(21)中获得RNA结合域的信息。我们还从AmiGO(22)收集了RBPs的GO术语注释。我们使用各自的基因组注释对RBP结合位点进行注释(人类,Gencode V27;小鼠,Gencode VM7;苍蝇,Flybase dmel-r6.18;蠕虫,WormBase ws235;拟南芥,TAIR10;酵母,SGD R64-1-1)(23-27)。为了能够系统地注释长非编码RNA(lncRNA)中的RBP结合位点,我们使用了Gencode(23)的人类和小鼠的lncRNA注释,以及NONCODE 2016(28)的苍蝇、蠕虫、拟南芥和酵母的lncRNA注释。RBP结合位点的基因组元素的分布显示了物种间的差异(图2C)。我们发现,人类和小鼠表现出类似的基因组元素模式,表明哺乳动物之间功能性RBP结合的保存。
We collected RNA-seq datasets from the 12 human cell/tissue types and 10 mouse cell/tissue types that are used in the CLIP experiments (Supplementary File S3), and mapped the reads using TopHat (29), followed by estimating the expression level of the genes using Cufflinks (30). For the 30 developmental stages from fly, 35 developmental stages from worm, 4 tissue types from Arabidopsis and 3 conditions (wild-type, glucose starvation and nitrogen starvation) for yeast, we obtained the gene expression data from the Expression Atlas (31) and our previous paper (32). We prepared and intersected miRNA-binding sites, RNA modification sites, RNA editing sites, SNPs and diseaseassociated variants with RBP-binding sites according to the same computational pipeline used in POSTAR (12). The coordinates of these genomic regions for human build hg19 were also converted to hg38 using the UCSC liftOver tool.
我们从CLIP实验中使用的12种人类细胞/组织类型和10种小鼠细胞/组织类型中收集RNA-seq数据集(补充文件S3),并使用TopHat(29)对读数进行映射,随后使用Cufflinks(30)估计基因的表达水平。对于苍蝇的30个发育阶段、蠕虫的35个发育阶段、拟南芥的4种组织类型和酵母的3种条件(野生型、葡萄糖饥饿和氮饥饿),我们从表达图谱(31)和我们以前的论文(32)中获得基因表达数据。我们根据POSTAR(12)中使用的相同的计算管道,准备了miRNA结合位点、RNA修饰位点、RNA编辑位点、SNPs和疾病相关变体与RBP结合位点的交叉。使用UCSC liftOver工具将人类构建hg19的这些基因组区域的坐标也转换为hg38。
We used the same strategy from POSTAR (12) to predict sequence motifs and structural preferences of RBPbinding sites. Briefly, the binding sites from each CLIP-seq sample were separated into independent training and testing set. Then, we used MEME (33) and HOMER (34) to identify and report up to five sequence motifs in the training set. Next, we calculated the enrichment for the initially detected motifs in the testing set using FIMO (35) and selected the three most enriched sequence motifs. The sequence motifs were visualized using WebLogo (36). To predict structural preferences of RBP-binding sites, the binding sites from each CLIP-seq sample were extended to at least 60 nt in length. We then used RNAcontext (37) to detect local structural motifs. The structural annotation used in RNAcontext included paired (P), hairpin loop (L), bulge/internal/multi-loop (M) and unstructured (U). In addition, we used RNApromo (38) to predict structural elements that are enriched within the RBP-binding sites (Pvalue <0.05).
我们使用POSTAR(12)的相同策略来预测RBP结合点的序列主题和结构偏好。简而言之,来自每个CLIP-seq样本的结合位点被分离成独立的训练和测试集。然后,我们使用MEME(33)和HOMER(34)来识别和报告训练集中多达五个序列主题。接下来,我们用FIMO(35)计算了测试组中最初检测到的主题的富集度,并选择了三个富集度最高的序列主题。使用WebLogo(36)对序列图案进行了可视化。为了预测RBP结合位点的结构偏好,每个CLIP-seq样本的结合位点被扩展到至少60 nt的长度。然后我们用RNAcontext(37)来检测局部结构图案。RNAcontext中使用的结构注释包括成对的(P)、发夹环(L)、隆起/内部/多环(M)和非结构化(U)。此外,我们使用RNApromo(38)来预测在RBP结合位点内富集的结构元素(P值<0.05)。
Ribo-seq datasets collection and ORF identification
Ribo-seq数据集收集和ORF鉴定
We collected 171 Ribo-seq datasets as well as matched RNA-seq datasets from the six species from the GEO and SRA databases (13) for translation efficiency (TE) calculation (Figure 2D; Supplementary File S4 and Supplementary File S5). For each Ribo-seq dataset, we overlapped with the annotated start codon and calculated its 5' distance to the first nucleotide of annotated start codons to infer the positions of peptidyl-site (P-site) for each read length. Thereafter, we applied this offset to represent the P-sites positions of all the reads that are of the same length and generated a P-site signal track for all transcripts based on the inferred P-sites positions for mapped reads.
我们从GEO和SRA数据库(13)收集了六个物种的171个Ribo-seq数据集以及相匹配的RNA-seq数据集,用于计算翻译效率(TE)(图2D;补充文件S4和补充文件S5)。对于每个Ribo-seq数据集,我们与被注释的起始密码子重叠,并计算其与被注释的起始密码子的第一个核苷酸的5'距离,以推断出每个读长的肽基位点(P-site)的位置。此后,我们用这个偏移量来表示所有相同长度的读数的P-位点位置,并根据映射读数推断的P-位点位置为所有的转录本生成一个P-位点信号轨道。
Figure 2. Statistics of POSTAR2 database. (A) Number of RBPs in the human, mouse, worm, fly, Arabidopsis and yeast. (B) The distribution of human RBP-binding sites on chromosomes. HNRNPC, HNRNPA1 and U2AF2 have the largest number of binding sites among 171 human RBPs. (C) Genomic distribution of RBP-binding sites in six species identified using Piranha. (D) Summary of CLIP-seq and Ribo-seq datasets. (E) Diagram for different ORF categories. (i) Annotated ORFs (aORFs): ORFs that are annotated by reference annotation, which are colored with black in the diagram. (ii and iii) Truncated and extended ORFs: ORFs that contain the same stop codon with aORFs, but have different translation initiation sites. (iv) Internal ORFs: ORFs that are located in or have partial overlap with aORFs. (v and vi) Upstream and downstream ORFs: ORFs that are located upstream or downstream of aORFs. (vii) Unannotated ORFs: ORFs that are defined from transcripts without any reference annotation. (F) Number of ORFs for each category across six species.
图2. POSTAR2数据库的统计。(A) 人类、小鼠、蠕虫、苍蝇、拟南芥和酵母中的RBPs数量。(B) 人类RBP结合点在染色体上的分布。在171个人类RBPs中,HNRNPC、HNRNPA1和U2AF2的结合位点数量最多。(C)使用Piranha确定的六个物种中RBP结合位点的基因组分布。(D)CLIP-seq和Ribo-seq数据集的摘要。(E) 不同的ORF类别图。(i) 注释的ORFs(aORFs)。通过参考注释的ORFs,在图中用黑色表示。(ii和iii) 截断的和扩展的ORFs。包含与aORFs相同的终止密码子的ORFs,但有不同的翻译起始位点。(iv) 内部ORFs。位于或与aORFs部分重叠的ORFs。 (v和vi) 上游和下游ORFs。位于aORFs上游或下游的ORFs。 (vii) 未标明的ORFs。从没有任何参考注释的转录本中定义的ORFs。(F) 六个物种中每个类别的ORF的数量。
For each species, the ORFs were predicted by scanning the transcript sequence in which we defined any possible AUG start codon pairing with nearest in-frame stop codon (UAA, UAG and UGA) as an ORF. ORFs shorter than 300 nt were defined as small ORFs (sORF). All predicted ORFs are further categorized into different subtypes according to their relative position with the aORFs (Figure 2E). In total, we identified ∼36 million ORFs among the six species, and numbers of ORFs showed the difference between different categories among six species (Figure 2F). To identify translated ORFs across different tissue types, cell lines, developmental stages and conditions, we used several computational tools, including RiboWave (39), RiboTaper (40), ORFscore (41) and RibORF (42), to detect pattern of 3-nt periodicity within each ORF, as well as the uneven distribution among different reading frames while translating. Default parameters were used for these tools.
对于每个物种,通过扫描转录本序列来预测ORFs,其中我们将任何可能的AUG起始密码子与最近的框内终止密码子(UAA、UAG和UGA)配对定义为一个ORF。短于300 nt的ORF被定义为小ORF(sORF)。所有预测的ORFs根据其与aORFs的相对位置被进一步分为不同的亚型(图2E)。在这六个物种中,我们总共确定了3600万个ORFs,ORFs的数量显示了六个物种中不同类别之间的差异(图2F)。为了识别不同组织类型、细胞系、发育阶段和条件下的翻译ORF,我们使用了几种计算工具,包括RiboWave(39)、RiboTaper(40)、ORFscore(41)和RibORF(42),以检测每个ORF内的3-nt周期性模式,以及翻译时不同阅读框架的不均匀分布。这些工具都使用了默认参数。
Translation efficiency and translation density calculation
翻译效率和翻译密度的计算
Translation efficiency (TE) measures the rate of messenger RNA translated into proteins, which can be estimated as the ratio between RPKM values of Ribo-seq and RNA-seq (6). We calculated TE under different tissue types, cell lines, developmental stages and conditions. We used either original signal of Ribo-seq (raw data) or denoised periodic footprint by RiboWave (39) (denoised data) as the estimation of riboseq signal strength.
翻译效率(TE)衡量信使RNA翻译成蛋白质的速度,它可以被估计为Ribo-seq和RNA-seq的RPKM值之间的比率(6)。我们在不同的组织类型、细胞系、发育阶段和条件下计算TE。我们使用Ribo-seq的原始信号(原始数据)或由RiboWave(39)去噪的周期性足迹(去噪数据)作为核糖核酸信号强度的估计。
Translation density is determined by normalizing the abundance of Ribo-seq reads along the studied ORF with the length of ORF to estimate the intensity of the ORF. We calculated translation density using both raw data (original ribo-seq signal) and denoised data (RiboWave-derived footprint) as input, and presented the results in both methods.
翻译密度是通过将沿所研究的ORF的Ribo-seq读数的丰度与ORF的长度归一来估计ORF的强度来确定的。我们使用原始数据(原始ribo-seq信号)和去噪数据(RiboWave衍生的足迹)作为输入计算翻译密度,并在两种方法中展示结果。
Database architecture
数据库架构
All data in POSTAR2 were processed and stored into a MySQL Database (version 5.6.39). The client-side user interface was implemented by the HTML5 and JavaScript libraries, including jQuery (http://jquery.com) and Bootstrap (http://getbootstrap.com). The server-side was used PHP scripts (version 5.6.39) and JavaScript. Plots of query results in POSTAR2 were generated by plotly.js library (https://plot.ly) and Highcharts (https://www.highcharts.com). Tables of query results were produced by the DataTables JavaScript library (https://www.datatables.net) that allows users to search and sort results. Visualization was implemented using the UCSC Genome Browser. We have tested web in several popular browsers including Google Chrome, Safari, Internet Explorer and Firefox.
POSTAR2中的所有数据都被处理并存储到MySQL数据库(5.6.39版)。客户端的用户界面由HTML5和JavaScript库实现,包括jQuery(http://jquery.com)和Bootstrap(http://getbootstrap.com)。服务器端使用的是PHP脚本(5.6.39版)和JavaScript。POSTAR2中的查询结果图是由plotly.js库(https://plot.ly)和Highcharts(https://www.highcharts.com)生成。查询结果的表格由DataTables JavaScript库(https://www.datatables.net)生成,允许用户对结果进行搜索和排序。可视化是使用UCSC基因组浏览器实现的。我们已经在几个流行的浏览器中测试了网络,包括谷歌浏览器、Safari、Internet Explorer和Firefox。
DATABASE FEATURES AND APPLICATIONS
数据库功能和应用
Web interface
网络界面
POSTAR2 provides a user-friendly interface for searching and visualizing protein–RNA interactions with multi-layer information of post-transcriptional regulation, diseaseassociated variation, as well as translation landscape of RNAs. POSTAR2 contains three modules (Figure 1B): (i) ‘RBP’ module; (ii) ‘RNA’ module, consisting of several sub-modules including ‘Binding sites’, ‘Crosstalk’, ‘Variation’ and ‘Disease’ and (iii) ‘Translatome’ module. Here, we briefly introduce each module below.
POSTAR2提供了一个用户友好的界面,用于搜索和可视化蛋白质-RNA的相互作用,包括转录后调控、疾病相关变异以及RNA的翻译情况等多层信息。POSTAR2包含三个模块(图1B):(i)"RBP "模块;(ii)"RNA "模块,由几个子模块组成,包括 "结合点"、"串扰"、"变异 "和 "疾病";(iii)"翻译组 "模块。在此,我们简要地介绍一下每个模块。
The ‘RBP’ module provides various annotations for the RBPs, including RNA recognition domains, RBP ontology, sequence motifs and structural preferences, as well as all the binding sites for the query RBP and enriched GO terms for the target genes (Figure 1C, lower-left panel).
RBP "模块提供了RBP的各种注释,包括RNA识别域、RBP本体、序列图案和结构偏好,以及查询RBP的所有结合位点和富集的目标基因的GO术语(图1C,左下角面板)。
As for the ‘RNA’ module (Figure 1C, upper panel), the ‘Binding sites’ sub-module provides all of the RBP-binding sites of the target gene, regardless of different CLIP-seq technologies or different peak calling methods. Furthermore, table and network view present the interaction of RBPs and target genes. We also collected multiple annotations for the target gene including genomic location, associated diseases, as well as expression patterns across different cell lines, tissue types, developmental stages or conditions. In addition, we defined ‘RBP-binding hotspots’ to decode number of binding proteins of each 20-nt bin along RNA’s precursor, which delivers an overview of the RBP binding hot regions of each RNA’s precursor to users. The ‘Crosstalk’ sub-module provides the interactions of RBPbinding sites and post-transcriptional regulations including miRNA targets, RNA modification and RNA editing (Figure 1B). RBPs participate in various steps and play vital roles in most post-transcriptional regulation processes so that users can investigate potential crosstalk of these regulatory events in this module. To understand how various genomic variants affect RBP binding and cooperate to orchestrate post-transcriptional regulation, the ‘Variation’ sub-module and the ‘Disease’ sub-module integrate SNVs and disease-associated SNVs to provide insights into the causal SNVs underlying regulatory mechanisms and human diseases (Figure 1B).
至于 "RNA "模块(图1C,上图),"结合点 "子模块提供了目标基因的所有RBP结合点,无论不同的CLIP-seq技术或不同的峰值调用方法。此外,表格和网络视图呈现了RBPs和目标基因的相互作用。我们还收集了目标基因的多个注释,包括基因组位置、相关疾病以及不同细胞系、组织类型、发育阶段或条件下的表达模式。此外,我们定义了 "RBP结合热点",以解码沿RNA前体的每个20nt仓的结合蛋白数量,从而向用户提供每个RNA前体的RBP结合热点区域的概况。Crosstalk "子模块提供了RBP结合点和转录后调控的相互作用,包括miRNA靶点、RNA修饰和RNA编辑(图1B)。RBPs参与各种步骤,在大多数转录后调控过程中发挥重要作用,因此用户可以在该模块中研究这些调控事件的潜在串扰。为了了解各种基因组变异如何影响RBP的结合并合作协调转录后调控,"变异 "子模块和 "疾病 "子模块整合了SNVs和疾病相关的SNVs,以提供对调控机制和人类疾病的因果SNVs的洞察力(图1B)。
In addition to the above two modules, we also built a new module ‘Translatome’ for characterizing the translation landscape of RNAs (Figure 1C, lower-right panel). Users can choose a species (e.g. human, mouse, fly, worm, Arabidopsis or yeast) and input a gene name to search within. POSTAR2 returns a summary frame and three tables, the summary frame contains a histogram shows the number of ORFs in different categories and a heat map provides the density of each ORF across various samples. These three tables present aORFs, extended/truncated ORFs and other ORFs, respectively, and each ORF is labeled according to the transcript ID, the relative reading frame of the ORF, the translation start site and termination site. Users can also sort ORFs by length in these tables to screen out sORF that are shorter than 300 nt. Moreover, each ORF ID provides a link for more details about the translation pattern of this ORF, including translation efficiency, translation density and identified translated region of the ORF. The column diagram provides visualization to compare translation state of the ORF across different tissue types, cell lines, developmental stages or conditions. In addition, users can select their interested conditions to simultaneously visualize signal tracks of each ORF along its located transcript.
除了上述两个模块,我们还建立了一个新的模块 "Translatome",用于描述RNA的翻译景观(图1C,右下角面板)。用户可以选择一个物种(如人类、小鼠、苍蝇、蠕虫、拟南芥或酵母),并输入一个基因名称进行搜索。POSTAR2返回一个摘要框和三个表格,摘要框包含一个直方图,显示不同类别的ORF数量,热图提供了每个ORF在不同样本中的密度。这三个表格分别显示了aORFs、扩展/截断的ORFs和其他ORFs,每个ORF都根据转录本ID、ORF的相对阅读框架、翻译起始点和终止点进行了标注。用户还可以在这些表格中按长度对ORF进行排序,以筛选出短于300 nt的sORF。此外,每个ORF ID提供了一个链接,以了解有关该ORF翻译模式的更多细节,包括翻译效率、翻译密度和确定的ORF的翻译区域。柱状图提供可视化,以比较不同组织类型、细胞系、发育阶段或条件的ORF的翻译状态。此外,用户可以选择他们感兴趣的条件,沿其定位的转录本同时可视化每个ORF的信号轨迹。
Example applications
应用实例
We designed a user-friendly interface, which provides a platform to connect protein–RNA interactions with multilayer information of post-transcriptional regulation and disease-associated variants, as well as translation landscape of RNAs. Here, we illustrate an example application with ADAM17 to demonstrate how to explore potential regulatory mechanism underlies human diseases.
我们设计了一个用户友好的界面,它提供了一个平台,将蛋白质-RNA相互作用与转录后调控和疾病相关变体的多层信息,以及RNA的翻译景观联系起来。在此,我们以ADAM17为例说明了如何探索人类疾病背后的潜在调控机制。
ADAM17 encodes a membrane-bound protease and previous study demonstrate its role in tumorigenesis and invasiveness especially breast cancer (43). We observed overexpression of ADAM17 across most tumor samples compared with normal tissues using TCGA expression data (44). However, ADAM17 expression at protein level and the potential regulatory mechanism remains unexplored. We queried ‘ADAM17’ in the ‘Translatome’ module, POSTAR2 returned a histogram showing the numbers of categorized ORFs of ADAM17. Users can click on the ORF IDs for more details. Estimation on translation efficiency and signal track reveals the up-regulation at translation level in tumor samples compared to normal. For instance, both raw data and denoised data showed upregulated translation efficiency in tumor tissue compared to paired normal tissue of brain and kidney (Figure 3A). To understand the potential mechanism that contribute to overexpression of ADAM17 at transcriptional and translational level, POSTAR2 shed light on RBP’s role in the regulatory mechanism. In the ‘RNA’ module, lots of RBPbinding sites identified by different CLIP-seq, the interaction network and RBP-binding hotspots represents numbers of RBP involved in the regulation of ADAM17 (Figure 3B). Among these RBPs, some RBPs such as EIF3B, EIF3G and EIF4A3 are the components of eukaryotic translation factor complex, which suggests that the interaction of these RBPs may participate in the translation regulatory of ADAM17. In addition, RBPs like FUS, TARDBP and ELAVL1 may contribute to the RNAs’ stability, which results in the aberrant expression level of RNAs or proteins. In addition, the output of the ‘Disease’ sub-module shows that lots of cancer mutations locate in the RBP-binding region on ADAM17, especially in kidney tumor and brain tumor.
ADAM17编码为一种膜结合蛋白酶,以前的研究表明它在肿瘤发生和侵袭中的作用,特别是乳腺癌(43)。我们利用TCGA的表达数据观察到,与正常组织相比,大多数肿瘤样本的ADAM17都过度表达(44)。然而,ADAM17在蛋白水平上的表达和潜在的调节机制仍未被探索。我们在 "Translatome "模块中查询了 "ADAM17",POSTAR2返回了一个直方图,显示了ADAM17的分类ORF的数量。用户可以点击ORF ID,了解更多细节。对翻译效率和信号轨迹的估计显示,与正常人相比,肿瘤样本在翻译水平上有上调。例如,原始数据和去噪数据都显示,与配对的正常脑和肾组织相比,肿瘤组织的翻译效率上升了(图3A)。为了了解导致ADAM17在转录和翻译水平过度表达的潜在机制,POSTAR2揭示了RBP在调节机制中的作用。在 "RNA "模块中,由不同的CLIP-seq、相互作用网络和RBP结合热点确定的大量RBP结合点代表了参与ADAM17调控的RBP数量(图3B)。在这些RBPs中,一些RBPs如EIF3B、EIF3G和EIF4A3是真核翻译因子复合物的组成部分,这表明这些RBPs的相互作用可能参与ADAM17的翻译调节。此外,像FUS、TARDBP和ELAVL1这样的RBPs可能有助于RNAs的稳定性,从而导致RNAs或蛋白质的表达水平失常。此外,"疾病 "子模块的输出显示,许多癌症突变位于ADAM17的RBP结合区,特别是在肾脏肿瘤和脑瘤。
Figure 3. Integrative viewing of translation activity of a target gene (ADAM17) and its post-transcriptionally regulation events. (A) In the ‘Translatome’ module, all ORFs in ADAM17 are summarized based on their categories (i). Users can investigate each ORF by clicking on the name of the ORF (ii). For example, in ADAM17, estimation on the translation efficiency (iii) and the signal track (iv) reveals the potential of translation up-regulation in tumor samples compared to normal. (B) In the RBP module, search on ADAM17 provides the interactions network of ADAM17 gene and various RBPs (v). The number of RBPs binding along the transcript (vi) and genomic context of the binding sites (vii) can be visualized and searched. At last, the impact of SNVs in RBP-binding sites in both TCGA (viii) and COSMIC (ix) datasets further supports the association between ADAM17 and tumorigenesis.
图3. 对一个目标基因(ADAM17)的翻译活动及其转录后调控事件的综合观察。(A) 在 "Translatome "模块中,ADAM17的所有ORFs根据其类别被总结出来(i)。用户可以通过点击ORF的名称来调查每个ORF(ii)。例如,在ADAM17中,对翻译效率(iii)和信号轨迹(iv)的估计揭示了与正常人相比,肿瘤样本中翻译上调的潜力。(B)在RBP模块中,对ADAM17的搜索提供了ADAM17基因和各种RBPs的相互作用网络(v)。沿着转录本结合的RBPs数量(vi)和结合点的基因组背景(vii)可以被可视化和搜索到。最后,在TCGA(viii)和COSMIC(ix)数据集中,RBP结合位点的SNVs的影响进一步支持了ADAM17和肿瘤发生之间的关联。
DISCUSSION AND FUTURE DIRECTIONS
讨论和未来方向
POSTAR2 aims to decipher the post-transcriptional regulatory logics by integrating large-scale high-throughput sequencing datasets and other public resources. To our knowledge, POSTAR2 hosts the largest collection (∼40 million) of RBP-binding sites identified from CLIP-seq experiments, and enables the exploration for RNA–protein interactions with other post-transcriptional regulatory events and genomic variations. Moreover, Ribo-seq data were incorporated and analyzed to reveal the translational dynamics of RNAs. POSTAR2 enables integrated navigation of RBP-binding sites with multi-layer information of posttranscriptional regulation, phenotypes, diseases, as well as translational landscapes of RNAs.
POSTAR2旨在通过整合大规模高通量测序数据集和其他公共资源来破译转录后调控逻辑。据我们所知,POSTAR2拥有从CLIP-seq实验中发现的最大的RBP结合位点集合(∼4千万),并且能够探索RNA-蛋白与其他转录后调控事件和基因组变异的相互作用。此外,Ribo-seq数据被纳入并分析,以揭示RNA的翻译动态。POSTAR2使RBP结合点的综合导航与转录后调控、表型、疾病以及RNA的翻译景观的多层信息相结合。
In comparison with our previous version of POSTAR, POSTAR2 has the following novel features and improvements: (i) POSTAR2 integrates more CLIP-seq datasets from human and mouse. (ii) POSTAR2 includes CLIP-seq datasets from more species, including fly, worm, Arabidopsis and yeast. In total, we added and updated ∼500 CLIPseq datasets in POSTAR2. (iii) POSTAR2 has a new module ‘Translatome’, which provides ∼36 million ORFs in the genomes from the six species. (iv) POSTAR2 annotates the RBP-binding sites with updated functional data resource. For example, we updated ∼1 million RNA modification sites and RNA editing sites curated from other databases and publications (45–47); updated and added ∼20 million SNPs from the genomes of the six species (48), as well as latest results of mutation-calling for TCGA samples (49). Finally, POSTAR2 provides an updated interactive interface to facilitate the investigation and exploration of RNA–protein interactions and translational landscape.
与我们以前的POSTAR版本相比,POSTAR2有以下新特点和改进。(i) POSTAR2整合了更多来自人类和小鼠的CLIP-seq数据集。 (ii) POSTAR2包括更多物种的CLIP-seq数据集,包括苍蝇、蠕虫、拟南芥和酵母。我们总共在POSTAR2中增加和更新了500个CLIPseq数据集。(iii) POSTAR2有一个新的模块 "Translatome",它提供了六个物种的基因组中的3600万个ORFs。(iv) POSTAR2通过更新功能数据资源对RBP结合点进行注释。例如,我们更新了从其他数据库和出版物中策划的100万个RNA修饰位点和RNA编辑位点(45-47);更新并增加了6个物种基因组中的2000万个SNPs(48),以及TCGA样本的最新突变计数的结果(49)。最后,POSTAR2提供了一个更新的互动界面,以促进对RNA-蛋白质相互作用和翻译景观的调查和探索。
As advances in high-throughput sequencing technologies, CLIP-seq and Ribo-seq technologies will be applied to more cell and tissue types in more species, and more functional genomics datasets will be generated. We will continue to integrate new incoming data and improve the web interface for navigation and visualization. We will maintain and keep updating POSTAR2 to ensure it remains a valuable resource for the research community.
随着高通量测序技术的进步,CLIP-seq和Ribo-seq技术将被应用于更多物种的更多细胞和组织类型,并将产生更多的功能基因组学数据集。我们将继续整合新进入的数据,并改进网络界面的导航和可视化。我们将维护并不断更新POSTAR2,以确保它继续成为研究界的宝贵资源——(现在进不去。。。)。
DATA AVAILABILITY
POSTAR2 is freely available at http://lulab.life.tsinghua.edu.cn/postar. The datasets in POSTAR2 can be download and used in accordance with the GNU Public License and the license of their primary data sources.
POSTAR2可在http://lulab.life.tsinghua.edu.cn/postar 上免费获取。POSTAR2中的数据集可以按照GNU公共许可证和其主要数据源的许可证下载和使用。
ACKNOWLEDGEMENTS
We thank the reviewers and editor for their comments and suggestions, which significantly improved the manuscript. We thank the ENCODE Project Consortium for sharing the eCLIP-seq data publicly.
我们感谢审稿人和编辑的意见和建议,这些意见和建议大大改进了稿件。我们感谢ENCODE项目联盟公开分享eCLIP-seq数据。
FUNDING
National Key Research and Development Plan of China [2016YFA0500803]; National Natural Science Foundation of China [31522030, 31771461]; Fok Ying-Tong Education Foundation; Beijing Advanced Innovation Center for Structural Biology; Bio-Computing Platform of Tsinghua University Branch of China National Center for Protein Sciences (Beijing). Funding for open access charge: National Key Research and Development Plan of China [2016YFA0500803]. Conflict of Interest Statement. None declared.
国家重点研发计划[2016YFA0500803];国家自然科学基金[31522030, 31771461];霍英东教育基金会;北京结构生物学高级创新中心;中国国家蛋白质科学中心(北京)清华大学分中心生物计算平台。开放访问收费的资金。中国国家重点研发计划[2016YFA0500803]。利益冲突声明。无人申报。