apache druid_我们如何使用Apache Druid实时分析为Kidtech提供强大的支持

apache druid

Co-authored with Saydul Bashar

Saydul Bashar合着

apache druid_我们如何使用Apache Druid实时分析为Kidtech提供强大的支持_第1张图片

Here at SuperAwesome, our mission is to make the internet safer for kids; to help accomplish this goal, our products power over 12 billion kid-safe digital transactions every month.

在SuperAwesome,我们的使命是让孩子们上网更安全; 为了帮助实现这一目标,我们的产品每月为超过120亿个儿童安全的数字交易提供支持。

Digital transactions come in many forms and could be:

数字交易有多种形式,可能是:

  • An ad served to a kid’s device through AwesomeAds

    通过AwesomeAds投放到儿童设备的广告

  • A video view on our new kids’ video gaming platform, Rukkaz

    我们的新儿童视频游戏平台Rukkaz上的视频视图

  • A like, comment, post or re-jam on PopJam

    PopJam顶赞,评论,发布或重新加入

Every digital transaction is processed to be instantly available for real-time analytics.

每个数字交易都经过处理,可以立即用于实时分析。

In kidtech, kid-safety and privacy protection are paramount, and a traditional approach to analytics and data engineering wouldn’t necessarily be COPPA and GDPR-K compliant.

在kidtech中,儿童安全和隐私保护至关重要,传统的分析和数据工程方法不一定符合COPPA和GDPR-K。

What makes a traditional digital transaction kid-safe is the absolute absence of personal identifiable information (PII), which is the foundation of our zero-data approach. This is the main characteristic that makes our real-time analytics kid-safe.

使传统的数字交易安全的原因在于绝对没有个人身份信息(PII),这是我们零数据方法的基础。 这是使我们的实时分析更加安全的主要特征。

Our kid-safe real-time analytics allow us to make the best and quickest decisions for our products and services, as well as our customers, and it enables us to work and iterate in a data-driven way.

儿童安全的实时分析使我们能够为我们的产品和服务以及我们的客户做出最佳,最快的决策,并且使我们能够以数据驱动的方式进行工作和迭代。

Aside from helping us make product and customer decisions, real-time analytics is also used to power some of our products. This is the case for AwesomeAds, where we use this data to drive real-time decision making.

除了帮助我们进行产品和客户的决策,实时分析也可用于电源我们的一些产品。 AwesomeAds就是这种情况,我们使用这些数据来推动实时决策。

When it comes to collecting, processing and storing this mammoth number of transactions with efficiency and durability, we found Apache Druid to be the perfect database for the job.

在高效,持久地收集,处理和存储大量事务时,我们发现Apache Druid是工作的理想数据库。

什么是Apache Druid? (What is Apache Druid?)

apache druid_我们如何使用Apache Druid实时分析为Kidtech提供强大的支持_第2张图片

Apache Druid is an open source distributed data store. Druid’s core design combines ideas from data warehouses, time series databases, and search systems to create a unified system for real-time analytics for a broad range of use cases. Druid merges key characteristics of each of the 3 systems into its ingestion layer, storage format, querying layer, and core architecture. — druid.io

Apache Druid是一个开源的分布式数据存储。 Druid的核心设计结合了来自数据仓库,时间序列数据库和搜索系统的想法,从而创建了一个统一的系统,可以对广泛的用例进行实时分析。 Druid将这三个系统中的每个系统的关键特征合并到其接收层,存储格式,查询层和核心体系结构中。 druid.io

In essence, Apache Druid is a highly performant, scalable and resilient database with low latency and high ingestion rate.

本质上,Apache Druid是高性能,可伸缩且具有弹性的数据库,具有低延迟和高接收率。

Druid has such high performance due to a number of reasons:

由于多种原因,德鲁伊具有如此高的性能:

  • It uses column-oriented storage, therefore it only needs to load the exact columns needed for a particular query

    它使用面向列的存储,因此只需要加载特定查询所需的确切

  • The high scalability of the database allows query times of sub-second to a couple of seconds

    数据库的高可伸缩性使查询时间可以在几秒钟到几秒钟之间

  • Druid can perform queries in parallel across a cluster, meaning a single query could be processed on many nodes

    Druid可以跨集群并行执行查询,这意味着可以在多个节点上处理单个查询

  • Druid creates indexes that power extremely fast filtering and searching across multiple columns

    德鲁伊创建的索引可在多个列中实现极快的过滤和搜索

我们如何设置我们的Druid基础架构?(How have we set up our Druid infrastructure?)

Infrastructure-as-Code

基础架构即代码

At SuperAwesome, we’re big believers in Infrastructure-as-code. We use Terraform to set up our infrastructure on AWS, and we use helm to set up and manage our infrastructure inside our Kubernetes clusters.

在SuperAwesome,我们坚信基础设施即代码。 我们使用Terraform在AWS上设置基础架构,并使用Helm在Kubernetes集群中设置和管理基础架构。

Image for post

Some key advantages of following Infrastructure-as-code practices are:

遵循基础结构即代码实践的一些关键优势是:

  • Everything is in code! We have a single source of truth with the definition of all our infrastructure

    一切都在代码中! 我们拥有所有基础架构定义的单一事实来源

  • We have immutable infrastructure. We don’t make changes through the AWS console or the Kubernetes dashboard, we just define what our infrastructure should look like in a declarative language and we let the tool do it for us

    我们拥有不变的基础设施。 我们不会通过AWS控制台或Kubernetes仪表板进行更改,我们只是以声明性语言定义基础架构的外观,然后让该工具为我们做到这一点

  • It helps us maintain environment parity. We can then have the same template with different values for different environments

    帮助我们维护环境 平价。 然后,我们可以针对不同环境使用具有不同值的相同模板

Druid, Kubernetes and Hadoop

德鲁伊,Kubernetes和Hadoop

Our different Druid processes run in StatefulSets in order for us to provide them with persistent volumes. Although our Druid setup is stateless, having data available on disk enables our processes to recover much faster when they restart.

我们不同的Druid进程在StatefulSets中运行,以便为它们提供持久卷。 尽管我们的Druid设置是无状态的,但在磁盘上具有可用数据可以使我们的进程在重新启动时恢复得更快。

These Druid processes communicate with each other through domain names. Kubernetes pods change IP addresses when they are recreated, therefore by using the domain names, we allow the overlord to see that the recreated middle manager / historical nodes are the same. This helps us avoid any need of re-networking between the processes and allows for smooth re-creations.

这些Druid进程通过域名相互通信。 Kubernetes Pod在重新创建时会更改IP地址,因此,通过使用域名,我们可以让霸主看到重新创建的中间管理器/历史节点是相同的。 这有助于我们避免在流程之间重新建立网络的任何需求,并允许平滑的重新创建。

我们的Druid架构是什么样的? (What does our Druid architecture look like?)

apache druid_我们如何使用Apache Druid实时分析为Kidtech提供强大的支持_第3张图片
Druid Architecture 德鲁伊建筑

Data Ingestion

资料撷取

In order to collect the best analytics from our products, our digital transactions are sent to Druid using Apache Kafka events. We found that Apache Kafka is an ideal system to integrate with Druid, particularly for its properties that make exactly-once ingestion possible.

为了从我们的产品中收集最好的分析数据,我们使用Apache Kafka事件将数字交易发送到Druid。 我们发现Apache Kafka是与Druid集成的理想系统,特别是因为它具有使一次摄取成为可能的特性。

Kafka events are pulled into Druid by the middle manager nodes as opposed to being pushed in by brokers. This allows the middle managers to manage their own rate of ingestion, and as each Kafka event is tagged with metadata related to their partition and offset, the middle managers can verify that they received what they expected and that no messages were dropped or re-sent.

Kafka事件由middle manager节点拉入Druid,而不是由代理人推送。 这使middle managers可以管理自己的接收速率,并且随着每个Kafka事件都标记有与其分区和偏移量有关的元数据, middle managers可以验证自己是否收到了期望的内容,并且没有消息被丢弃或重新发送。 。

These data ingestion tasks are distributed to the middle manager nodes using the overlords — once consumed, the Druid segments are stored in deep storage (S3). As long as Druid processes can see this storage infrastructure and access these segments, there will be no data loss in the event of losing any Druid nodes.

这些数据摄取任务使用overlords分发到中层管理器节点-一旦被消耗,德鲁伊段就存储在深度存储中(S3)。 只要Druid进程可以看到该存储基础结构并访问这些段,就不会丢失任何Druid节点,也不会丢失数据。

Data Loading

资料载入

The coordinator process is responsible for assigning newly added segments to the historical nodes for local caching — this is done with the help of Zookeeper.

coordinator进程负责将新添加的段分配给historical节点以进行本地缓存-这是在Zookeeper的帮助下完成的。

When a historical node notices a new segment has been added to deep storage, it will check in its local cache for whether it has information on this segment. If that information doesn’t exist, the historical node will download metadata about this new segment from Zookeeper. This metadata would include details about where the segment is located in deep storage, as well as how to decompress and process the segment.

historical节点注意到新的段已添加到深度存储中时,它将检查其本地缓存中是否具有有关此段的信息。 如果该信息不存在,则historical节点将从Zookeeper下载有关此新段的元数据。 该元数据将包含有关该段在深度存储中的位置以及如何解压缩和处理该段的详细信息。

Once the historical node has finished processing the segment, Zookeeper is then made aware of the processed segment, and the segment becomes available for querying.

historical节点完成对片段的处理后, Zookeeper就会知道已处理的片段,并且该片段可用于查询。

Querying

查询方式

When it comes to relaying these events back to the client, queries are processed by the broker nodes which evaluate the metadata published to Zookeeper about what particular segments exist on each set of nodes (either middle managers or historicals), and then routes the queries to the correct nodes to retrieve those particular segments.

在将这些事件中继回客户端时, broker节点将处理查询,这些broker节点会评估发布到Zookeeper的元数据,以了解每个节点集( middle managershistoricals )上存在哪些特定段,然后将查询路由到正确的节点以检索那些特定的段。

Re-indexing

重新索引

We ensure we regularly re-index segments using Apache Hadoop, which has been set up in AWS EMR using Terraform. Re-indexing (also known as re-building segments) reduces the number of rows in the database, therefore helping to ensure our queries won’t get slower over time.

我们确保定期细分受众群重新编制索引 使用Apache Hadoop ,该Apache Hadoop已在TerraEM的AWS EMR中设置。 重新索引(也称为重建段)减少了数据库中的行数,因此有助于确保查询不会随着时间的推移而变慢。

When we ingest real-time data, the granularity we maintain is hourly. At the end of each day, we have Kubernetes cron jobs which deploy re-indexing jobs on Hadoop to roll-up these hourly segments into daily segments. Finally, at a certain point during the month, we re-index these daily segments into monthly segments. In essence, the idea behind re-indexing is to roll-up our metrics into larger time frames — this, of course, means we aren’t able to query the smaller granularities, but it greatly reduces our storage requirements costs and query times.

当我们摄取实时数据时,我们维护的粒度是每小时一次。 每天结束时,我们都有Kubernetes cron作业,这些作业在Hadoop上部署了重新索引作业,以将这些小时段汇总为每日段。 最后,在该月的某个时间点,我们将这些每日细分重新编入每月细分。 本质上,重新索引背后的想法是将指标汇总到更大的时间范围内—当然,这意味着我们无法查询较小的粒度,但是它大大减少了我们的存储需求成本和查询时间。

最后的想法 (Final Thoughts)

Thanks to our approach to real-time analytics, we’re able to relay key insights and data points back to our stakeholders and customers, as well as use this data to power our products themselves. This all happens in a kid-safe way and enables us to deliver the best level of service to billions of U16s every month.

借助我们的实时分析方法,我们能够将关键的见解和数据点传递回我们的利益相关者和客户,并使用这些数据为我们的产品提供动力。 所有这一切都以儿童安全的方式进行,使我们每个月都能为数十亿个U16提供最佳水平的服务。

Our approach to kid-safe analytics and data engineering keeps evolving and we are always looking for bright minds and driven individuals to take it to the next level. Do you want to be part of it? Check out our job openings here.

我们在儿童安全分析和数据工程方面的方法不断发展,我们一直在寻找聪明的人和有动力的个人将其提升到一个新的水平。 您想成为其中的一部分吗? 在这里查看我们的职位空缺。

翻译自: https://medium.com/superawesome-engineering/how-we-use-apache-druids-real-time-analytics-to-power-kidtech-at-superawesome-8da6a0fb28b1

apache druid

你可能感兴趣的:(java,python,linux,大数据,人工智能)