kafka stream入门1

kafka stream入门1

最近本人在单位经常有对于大量心跳数据进行汇总计算,然后更加计算汇总出不同种类的中间数据集合,来提供后期的处理的需求。
原先的方案是自己写了不少的job,然后利用zookeeper等进行job进度的控制,问题是这种模式下,需要大量的编码,保证数据不被重复消费,感觉自己的程序在出现异常的时候,
还是会有部分数据丢失的问题。

考虑采用一个业绩主流的流式计算的方案,同时也要支持对于历史数据的批量操作。

对比了spark,storm,kafka_stream,首先本人完全没有大数据的实战经验,个人感觉,前两者相对成熟很多,后者kafka_stream是新出来的,相对资源少。
但是前两者是框架级别的,以spark为例,看了下,一般要单独部署一套自己的spark集群(除非单位有现成的给你使用)我们这边是不具备的。搭建的硬件要求也很高。
对比kafkastream,其只是个库,依赖只有kafka,硬件资源需求较小,决定自己先研究下。如果可行,就投入生产。
以下摘录一个最简单的入门的案例。

后续继续补全。

package io.confluent.examples.streams;

import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KStreamBuilder;
import org.apache.kafka.streams.kstream.KTable;

import java.util.Arrays;
import java.util.Properties;
import java.util.regex.Pattern;

/**
 * Demonstrates, using the high-level KStream DSL, how to implement the WordCount program that
 * computes a simple word occurrence histogram from an input text. This example uses lambda
 * expressions and thus works with Java 8+ only.
 * 

* In this example, the input stream reads from a topic named "TextLinesTopic", where the values of * messages represent lines of text; and the histogram output is written to topic * "WordsWithCountsTopic", where each record is an updated count of a single word, i.e. {@code word (String) -> currentCount (Long)}. *

* Note: Before running this example you must 1) create the source topic (e.g. via {@code kafka-topics --create ...}), * then 2) start this example and 3) write some data to the source topic (e.g. via {@code kafka-console-producer}). * Otherwise you won't see any data arriving in the output topic. *

*
* HOW TO RUN THIS EXAMPLE *

* 1) Start Zookeeper and Kafka. Please refer to QuickStart. *

* 2) Create the input and output topics used by this example. *

 * {@code
 * $ bin/kafka-topics --create --topic TextLinesTopic \
 *                    --zookeeper localhost:2181 --partitions 1 --replication-factor 1
 * $ bin/kafka-topics --create --topic WordsWithCountsTopic \
 *                    --zookeeper localhost:2181 --partitions 1 --replication-factor 1
 * }
* Note: The above commands are for the Confluent Platform. For Apache Kafka it should be {@code bin/kafka-topics.sh ...}. *

* 3) Start this example application either in your IDE or on the command line. *

* If via the command line please refer to Packaging. * Once packaged you can then run: *

 * {@code
 * $ java -cp target/streams-examples-3.3.0-standalone.jar io.confluent.examples.streams.WordCountLambdaExample
 * }
* 4) Write some input data to the source topic "TextLinesTopic" (e.g. via {@code kafka-console-producer}). * The already running example application (step 3) will automatically process this input data and write the * results to the output topic "WordsWithCountsTopic". *
 * {@code
 * # Start the console producer. You can then enter input data by writing some line of text, followed by ENTER:
 * #
 * #   hello kafka streams
 * #   all streams lead to kafka
 * #   join kafka summit
 * #
 * # Every line you enter will become the value of a single Kafka message.
 * $ bin/kafka-console-producer --broker-list localhost:9092 --topic TextLinesTopic
 * }
* 5) Inspect the resulting data in the output topic, e.g. via {@code kafka-console-consumer}. *
 * {@code
 * $ bin/kafka-console-consumer --topic WordsWithCountsTopic --from-beginning \
 *                              --new-consumer --bootstrap-server localhost:9092 \
 *                              --property print.key=true \
 *                              --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
 * }
* You should see output data similar to below. Please note that the exact output * sequence will depend on how fast you type the above sentences. If you type them * slowly, you are likely to get each count update, e.g., kafka 1, kafka 2, kafka 3. * If you type them quickly, you are likely to get fewer count updates, e.g., just kafka 3. * This is because the commit interval is set to 10 seconds. Anything typed within * that interval will be compacted in memory. *
 * {@code
 * hello    1
 * kafka    1
 * streams  1
 * all      1
 * streams  2
 * lead     1
 * to       1
 * join     1
 * kafka    3
 * summit   1
 * }
* 6) Once you're done with your experiments, you can stop this example via {@code Ctrl-C}. If needed, * also stop the Kafka broker ({@code Ctrl-C}), and only then stop the ZooKeeper instance (`{@code Ctrl-C}). */ public class WordCountLambdaExample { public static void main(final String[] args) throws Exception { final String bootstrapServers = args.length > 0 ? args[0] : "localhost:9092"; final Properties streamsConfiguration = new Properties(); // Give the Streams application a unique name. The name must be unique in the Kafka cluster // against which the application is run. streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example"); streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "wordcount-lambda-example-client"); // Where to find Kafka broker(s). streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers); // Specify default (de)serializers for record keys and for record values. streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); // Records should be flushed every 10 seconds. This is less than the default // in order to keep this example interactive. streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000); // For illustrative purposes we disable record caches streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0); // Set up serializers and deserializers, which we will use for overriding the default serdes // specified above. final Serde stringSerde = Serdes.String(); final Serde longSerde = Serdes.Long(); // In the subsequent lines we define the processing topology of the Streams application. final KStreamBuilder builder = new KStreamBuilder(); // Construct a `KStream` from the input topic "TextLinesTopic", where message values // represent lines of text (for the sake of this example, we ignore whatever may be stored // in the message keys). // // Note: We could also just call `builder.stream("TextLinesTopic")` if we wanted to leverage // the default serdes specified in the Streams configuration above, because these defaults // match what's in the actual topic. However we explicitly set the deserializers in the // call to `stream()` below in order to show how that's done, too. final KStream textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic"); final Pattern pattern = Pattern.compile("\\W+", Pattern.UNICODE_CHARACTER_CLASS); final KTable wordCounts = textLines // Split each text line, by whitespace, into words. The text lines are the record // values, i.e. we can ignore whatever data is in the record keys and thus invoke // `flatMapValues()` instead of the more generic `flatMap()`. .flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase()))) // Count the occurrences of each word (record key). // // This will change the stream type from `KStream` to `KTable` // (word -> count). In the `count` operation we must provide a name for the resulting KTable, // which will be used to name e.g. its associated state store and changelog topic. // // Note: no need to specify explicit serdes because the resulting key and value types match our default serde settings .groupBy((key, word) -> word) .count("Counts"); // Write the `KStream` to the output topic. wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic"); // Now that we have finished the definition of the processing topology we can actually run // it via `start()`. The Streams application as a whole can be launched just like any // normal Java application that has a `main()` method. final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration); // Always (and unconditionally) clean local state prior to starting the processing topology. // We opt for this unconditional call here because this will make it easier for you to play around with the example // when resetting the application for doing a re-run (via the Application Reset Tool, // http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool). // // The drawback of cleaning up local state prior is that your app must rebuilt its local state from scratch, which // will take time and will require reading all the state-relevant data from the Kafka cluster over the network. // Thus in a production scenario you typically do not want to clean up always as we do here but rather only when it // is truly needed, i.e., only under certain conditions (e.g., the presence of a command line flag for your app). // See `ApplicationResetExample.java` for a production-like example. streams.cleanUp(); streams.start(); // Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams Runtime.getRuntime().addShutdownHook(new Thread(streams::close)); } }

这里的逻辑答题上就是从kafka的输入stream TextLinesTopic里,不断读入用户输入的文本行,

 final KStream textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic");

然后针对每行输入用正则表达式查封成各个word。flatMap到word 单词数据流

flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))

接下来,按照不同的单词进行分组

groupBy((key, word) -> word)

最后把kstream 通过count进行转存到ktable里,后续可以通过ksql进行查询
切记,streams需要开启才能工作

streams.start();

你可能感兴趣的:(kafka stream入门1)