apache kafka_轻松实现Apache Kafka集成的考虑因素

apache kafka

Apache Kafka’s real-world adoption is exploding, and it claims to dominate the world of stream data. It has a huge developer community all over the world that keeps on growing. But, it can be painful too. So, just before jumping head first and fully integrating with Apache Kafka, let’s check the water and plan ahead for painless integration.

Apache Kafka在现实世界中的采用正在爆炸式增长,它声称在流数据领域占据主导地位。 它在世界各地都有庞大的开发者社区,并且还在不断发展。 但是,这也会很痛苦。 因此,在紧紧抓住先机并与Apache Kafka完全集成之前,让我们先检查一下水,然后计划进行无痛的集成。

它是什么? (What is it?)

Apache Kafka is an open source framework for asynchronous messaging and it’s a distributed streaming platform. It is TCP based. The messages are persisted in topics. Message producers are called publishers and message consumers are called subscribers.

Apache Kafka是一个用于异步消息传递的开源框架,并且是一个分布式流平台。 它基于TCP。 这些消息保留在主题中。 消息生产者称为发布者 ,消息使用者称为订阅者

Consumers can subscribe to one or more topics and consume all the messages in that topic. Messages are written into the topic partitions.

消费者可以订阅一个或多个主题,并消费该主题中的所有消息。 消息将写入主题分区。

Topics are always multilayer subscriber, they can have zero, one, or many consumers that subscribe to the data written to it. For each topic Kafka maintains a partition log. Metadata for the partition’s logs and topics are usually managed by Zookeeper.

主题始终是多层订阅者,它们可以有零个,一个或多个订阅者来订阅写入其中的数据。 对于每个主题,Kafka维护一个分区日志。 分区日志和主题的元数据通常由Zookeeper管理。

If you would like to learn more about Kafka message delivery semantics — like, at most once, at least once and exactly once — read here.

如果您想了解有关Kafka消息传递语义的更多信息(例如最多一次至少一次恰好一次),请阅读此处 。

Many tech companies have already integrated Apache Kafka into their production as a message broker, user activities tracking pipeline, metrics gatherer, log aggregation mechanism, stream processing device and much more. Apache Kafka is written in Scala and Java.

许多高科技公司已经将Apache Kafka作为消息代理,用户活动跟踪管道,指标收集器,日志聚合机制,流处理设备等集成到了产品中。 Apache Kafka用Scala和Java编写。

为什么选择卡夫卡? (Why Kafka?)

  • Kafka provides High Availability and Fault Tolerance message logs. Kafka clusters retain all published records. It is by default persistent — If you don’t set a limit for Kafka, it will keep records until it runs out of disk space. When data loss means awful failure for the product, this is essential for recovery.

    Kafka提供高可用性容错消息日志。 Kafka集群保留所有已发布的记录。 默认情况下,它是持久性的—如果您未设置Kafka的限制,它将保留记录,直到磁盘空间用尽。 当数据丢失意味着产品严重故障时,这对于恢复至关重要。

  • Multiple Topic Consumers — when configuring the consumers under multiple consumers groups, it helps to reduce the old bottleneck of sending the data to multiple applications for processing. Kafka is distributed, hence, it can send information to consumers from various physical machines/services instances. Replicating topics to a secondary cluster is also relatively easy using Apache Kafka’s mirroring feature, MirrorMaker — see an example of mirroring data between two HDInsight clusters. Just remember, if multiple consumers are defined as part of the same group (defined by the group.id) the data will be balanced over all the consumers within the group.

    多个主题使用者 -在多个使用者组下配置使用者时,它有助于减少将数据发送到多个应用程序进行处理的旧瓶颈。 Kafka是分布式的,因此它可以从各种物理机器/服务实例向消费者发送信息。 使用Apache Kafka的镜像功能MirrorMaker,将主题复制到辅助集群也相对容易,请参见在两个HDInsight集群之间镜像数据的示例 。 请记住 ,如果将多个使用者定义为同一组的一部分(由group.id定义),则数据将在组内的所有使用者之间保持平衡。

  • Kafka is polyglot — there are many clients in C#, Java, C, python and more. The ecosystem also provides a REST proxy which allows easy integration via HTTP and JSON.

    Kafka是会说多种语言的人 -C#,Java,C,python等中有许多客户端。 该生态系统还提供了一个REST代理,可通过HTTP和JSON轻松集成。

  • Real-Time Handling — Kafka can handle real-time data pipelines for real time messaging for applications.

    实时处理 -Kafka可以处理用于应用程序实时消息传递的实时数据管道。

  • Scalable — due to distributed architecture, Kafka can scale out without incurring any downtime.

    可扩展 -由于采用分布式架构,Kafka可以横向扩展而不会造成任何停机。

  • and more…

    和更多…

让我们轻松地与Kafka集成 (Let’s make integration with Kafka painless)

整合之前,您需要了解以下6件事: (Here are 6 things to know before integrating:)

1 — Apache Zookeeper can become a pain point with a Kafka cluster

1-Apache Zookeeper可能成为Kafka集群的痛点

In the past ( versions < 0.81) Kafka used Zookeeper to maintain offsets of each topic and partition. Zookeeper used to take part in the read path, where too frequent commits and too many consumers led to sever performance and stability issues.

过去(版本<0.81),Kafka使用Zookeeper来维护每个主题和分区的偏移量。 Zookeeper曾经参与过读取路径,在该路径中,频繁的提交和太多的使用者会导致严重的性能和稳定性问题。

On top of that, it is better to use commits manually with old Zookeeper-based consumers, since careless auto-commits could lead to data loss.

最重要的是,最好与基于Zookeeper的旧使用者手动使用提交,因为粗心的自动提交可能会导致数据丢失。

The newer versions of Kafka offer their own management, where the consumer can use Kafka itself to manage offsets. This means that there is a specific topic that manages the read offsets instead of Zookeeper.

较新版本的Kafka提供了自己的管理,消费者可以在其中使用Kafka本身来管理胶印。 这意味着有一个特定的主题而不是Zookeeper来管理读取的偏移量。

Yet, Kafka still needs a cluster with Zookeeper, even in the later versions 2.+. Zookeeper is used to store Kafka configs (reassigning partitions when needed) and the Kafka topics API, like create topic, add partition, etc.

但是,即使在更高版本的2. +中,Kafka仍需要使用Zookeeper集群。 Zookeeper用于存储Kafka配置(需要时重新分配分区)和Kafka主题API,例如创建主题,添加分区等。

The load on Kafka is strictly related to the number of consumers, brokers, partitions and frequency of commits from the consumer.

Kafka上的负载严格与使用者的数量,代理,分区和来自使用者的提交频率有关。

2 — You shouldn’t send large messages or payloads through Kafka

2-您不应该通过Kafka发送大型邮件或有效载荷

According to Apache Kafka, for better throughput, the max message size should be 10KB. If the messages are larger than this, it is better to check the alternatives or find a way to chop the message into smaller parts before writing to Kafka. Best practice to do so is using a message key to make sure all chopped messages will be written to the same partition.

根据Apache Kafka的说法,为了获得更好的吞吐量,最大消息大小应为10KB 如果消息大于此大小,则最好在写入Kafka之前检查替代方法或找到将消息切成较小部分的方法。 这样做的最佳实践是使用消息密钥来确保所有切碎的消息都将写入同一分区。

3 — Apache Kafka can’t transform data

3-Apache Kafka无法转换数据

Many developers are mistaken and think that they can create Kafka parsers or do a data transformation over Kafka. However, Kafka does not enable transformation of data. If you are using Azure services, there is a great list of data factories services that you can use to transform the data like Azure Databricks, HDInsights Spark and others that connects to Kafka.

许多开发人员都错了,认为他们可以创建Kafka解析器或在Kafka上进行数据转换。 但是,Kafka无法启用数据转换。 如果使用的是Azure服务,则可以使用大量的数据工厂服务来转换数据,例如Azure Databricks , HDInsights Spark以及连接到Kafka的其他数据。

Another solution is using Apache Kafka stream. This is actually a new API that is build on top of Kafka’s producer and consumer clients. It’s significantly more powerful and also more expressive than the Kafka consumer client.

另一个解决方案是使用Apache Kafka流 。 这实际上是在Kafka的生产者和消费者客户端之上构建的新API。 它比Kafka消费者客户端强大得多,表达能力也更高。

The KafkaStreams client allows us to perform continuous computation on input coming from one or more input topics and sends output to zero, one, or more output topics. Internally a KafkaStreams instance contains a normal KafkaProducer and KafkaConsumer instance that is used for reading input and writing output.

KafkaStreams 客户程序允许我们对来自一个或多个输入主题的输入执行连续计算,并将输出发送到零个,一个或多个输出主题。 内部的KafkaStreams实例包含一个普通的KafkaProducerKafkaConsumer实例,用于读取输入和写入输出。

Another option is using Flink, check it out here.

另一个选择是使用Flink,请在此处查看 。

4 — Apache Kafka supports a binary protocol over TCP

4-Apache Kafka支持基于TCP的二进制协议

Apache Kafka communication protocol is TCP based. It doesn’t support MQTT or JMS or other non-based TCP protocols out of the box. However, many users have written adaptors to read data from those protocols and write to Apache Kafka. For example kafka-jms-client.

Apache Kafka通信协议基于TCP。 它不支持开箱即用的MQTT或JMS或其他非基于TCP的协议。 但是,许多用户已经编写了适配器来从那些协议中读取数据并写入Apache Kafka。 例如kafka-jms-client 。

5 — Apache Kafka management / support and the steep learning curve

5-Apache Kafka管理/支持和陡峭的学习曲线

As of today, there are limited free UI based management system for Apache Kafka, and most the the DevOps I worked with are using scripting tools. However, it can be tedious for beginner to jump into Apache Kafka scripting tools without taking the time for training. The Learning curve is steep and takes some time to get moving and integrate into big running systems.

到目前为止,Apache Kafka的基于UI的免费管理系统有限,并且我与之合作的大多数DevOps都在使用脚本工具。 但是,对于初学者而言,不花时间进行培训就跳入Apache Kafka脚本工具可能很繁琐。 学习曲线陡峭,需要花费一些时间才能移动并集成到大型运行系统中。

For experienced DevOps/ developers it might take a few months (2+) to fully understand how to integrate, support and work with Apache Kafka. It is important to learn how Kafka works in order to use the configuration in the way that will best suit the system’s needs.

对于经验丰富的DevOps /开发人员,可能需要几个月(2+)才能完全了解如何集成,支持和使用Apache Kafka。 了解Kafka的工作方式非常重要,以便以最适合系统需求的方式使用配置。

Here’s a list of management tools that you can use for almost free (some are restricted to personal/community use):

这是您几乎可以免费使用的管理工具的列表(有些仅限于个人/社区使用):

  • KafkaTool — GUI application for managing and using Apache Kafka clusters.

    KafkaTool —用于管理和使用Apache Kafka集群的GUI应用程序。

  • Confluent platform — full enterprise streaming platform solution.

    融合平台 -完整的企业流平台解决方案。

  • KafDrop — tool for displaying information such as brokers, topics, partitions, and even lets you view messages. It is a lightweight application that runs on Spring Boot and requires very little configuration.

    KafDrop-用于显示诸如代理,主题,分区等信息的工具,甚至可以让您查看消息。 它是一个运行在Spring Boot上的轻量级应用程序,只需很少的配置。

  • Yahoo Kafka Manager —another tool for monitoring Kafka, yet it offers much less than the rest.

    Yahoo Kafka Manager-监视Kafka的另一个工具,但它提供的功能比其他工具少得多。

Supporting Managed Kafka on the cloud

在云端支持Managed Kafka

Today almost all clouds support Kafka, if it is fully managed or using integration with Confluent from the cloud store up to just purchasing Kafka machines:

如今,几乎所有的云都支持Kafka,如果对其进行了完全管理或使用与Confluent的集成(从云存储到购买Kafka机器):

  • Confluent Cloud- Kafka as a Service

    融合云-Kafka即服务

  • Azure Event Hub- fully managed Kafka

    Azure Event Hub-完全托管的Kafka

  • Managed Kafka on HDInsight — Azure

    HDInsight上的托管Kafka-Azure

  • Kafka Machine on Google cloud

    Google Cloud上的Kafka Machine

  • Kafka on AWS using Confluent solution

    使用Confluent解决方案在AWS上使用Kafka

  • ... many more

    ... 还有很多

6 — Kafka is no magic — There is still a possibility of data loss

6-卡夫卡不是魔术-仍然存在数据丢失的可能性

Apache Kafka is probably the most popular tool for distributed asynchronous messaging. This is mainly due to his high throughput, low latency, scalability, centralised and real time abilities. Most of this is due to using data replicas which in Kafka are called partitions.

Apache Kafka可能是最流行的分布式异步消息传递工具。 这主要是由于他的高吞吐量,低延迟,可伸缩性,集中式和实时能力。 这主要是由于使用了在Kafka中称为分区的数据副本。

However, with misconfiguration there is a high chance of data loss when machines/processes are failing, and they will fail. Therefore, it’s important to understand how Kafka works and what the product/system requirements are.

但是,如果配置错误,则机器/进程出现故障时很有可能会丢失数据 ,并且它们将失败。 因此,了解Kafka的工作方式以及产品/系统要求是很重要的。

7 — Kafka built-in failure testing framework Trogdor

7 — Kafka内置的故障测试框架Trogdor

To assist you in finding the right configuration, the Kafka team created Trogdor. Trogdor is a failure testing framework.

为了帮助您找到正确的配置,Kafka团队创建了Trogdor 。 Trogdor是一个故障测试框架。

How it works

这个怎么运作

  • Configure Kafka the way you would in production

    以生产方式配置Kafka
  • Create a producer that generates messages with sequence 1…X million.

    创建一个生成器,该生成器生成序列为1…X百万的消息。
  • Run the producer

    运行生产者
  • Run the consumer

    运行消费者
  • Create failure by crashing and/or hanging broker.

    通过崩溃和/或挂起代理来创建失败。
  • Test and check that every event produced was consumed.

    测试并检查产生的每个事件是否已消耗。
  • … if that’s not the case, it is better to go back and update the configuration accordingly!

    …如果不是这种情况,最好返回并相应地更新配置!

最重要的是,记住Apache Kafka…… (On top of that, it is important to remember that Apache Kafka …)

  • Is not a RPC —Apache Kafka is a messaging system. For RPC, service X needs to be aware of Service Y and the call signature. For example, in Kafka, if you send a message it doesn't mean that someone will consume it, ever. In RPC, there is always a consumer since the service itself is aware of the consumer Y and creates a call to its signature/function.

    不是RPC- Apache Kafka是消息传递系统。 对于RPC,服务X需要知道服务Y和呼叫签名。 例如,在Kafka中,如果您发送一条消息,并不意味着有人会永远使用它。 在RPC中,总是有一个使用者,因为服务本身知道使用者Y,并创建对其签名/功能的调用。

  • It is not a Database — it’s not a good place to save messages since you can’t jump between them or create a search without an expensive full scan.

    不是数据库 -它不是保存消息的好地方,因为如果没有昂贵的全面扫描,您就无法在消息之间跳转或创建搜索。

只需一谈KSQL (Just a word about KSQL)

An interesting library brought to us by the Confluent Community is KSQL. It is build on top of Kafka stream. KSQL is a completely interactive SQL interface. You can use it without writing any code. KSQL is under the Confluent Community licensing.

Confluent社区为我们带来的一个有趣的库是KSQL 。 它建立在Kafka流之上。 KSQL是一个完全交互式SQL接口。 您无需编写任何代码即可使用它。 KSQL受Confluent社区许可。

TL; DR (TL;DR)

Apache Kafka has many benefits, yet before adding it in production, one should be aware that:

Apache Kafka有很多好处,但是在将其添加到生产环境之前,应该意识到以下几点:

  • It has a steep learning curve — make time to learn the bits and bits of Kafka

    它具有陡峭的学习曲线-花时间学习Kafka的点点滴滴
  • You must manage cluster resources — be aware of the requirements like Zookeeper

    您必须管理群集资源-注意Zookeeper之类的要求
  • You can still lose data with Apache Kafka

    您仍然可以使用Apache Kafka丢失数据
  • Most clouds provide managed Apache Kafka

    大多数云提供托管的Apache Kafka
  • It won’t transform data

    它不会转换数据
  • It’s not a Database

    它不是数据库
  • It support binary protocol over TCP protocol

    它支持基于TCP协议的二进制协议
  • At the moment, you can’t sent large messages using Kafka

    目前,您无法使用Kafka发送大型邮件
  • You should use Trogdor for fault testing of your system

    您应该使用Trogdor对系统进行故障测试

All that being said, Apache Kafka is probably the best tool for messaging and streaming tasks.

综上所述,Apache Kafka可能是用于消息传递和流传输任务的最佳工具。

Thank you Gwen Shapira for your input and guidance along the way.

感谢Gwen Shapira在此过程中提供的意见和指导。

If you enjoyed this story, please click the ? button. Feel free to leave a comment below.

如果您喜欢这个故事,请单击“?”。 按钮。 随时在下面发表评论。

Follow me here, or here for more posts about Scala, Kotlin, Big data, clean code and software engineers nonsense. Cheers!

在这里关注我 ,或者在这里获取有关Scala,Kotlin,大数据,简洁代码和无用的软件工程师的更多信息。 干杯!

翻译自: https://www.freecodecamp.org/news/what-to-consider-for-painless-apache-kafka-integration-df559e828876/

apache kafka

你可能感兴趣的:(分布式,大数据,数据库,python,java)