PyFlink基础应用之kafka

PyFlink基础应用之kafka
运行环境
PyFlink需要特定的Python版本(3.5、3.6或3.7)。运行一下命令,以确保Python版本满足要求。
$ python -V

PyFlink已经发布到PyPi,可以直接安装:
$ python -m pip install apache-flink

拷贝三个jar包到FLINK_HOME/lib下。
flink-connector-kafka_2.11-1.11.0.jar
flink-sql-connector-kafka_2.11-1.11.0.jar
kafka-clients-2.4.1.jar

作者的运行环境为python3.7.6、flink1.11.0、kafka2.11(broker1.0.0)。预研的时候遇到最多的问题是缺少jar包和jar包冲突,多看执行输出的日志,根据日志错误提示补充相应的jar包。
参考资料有:
https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/table/python/installation.html
https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/table/connectors/kafka.html

运行kafka
创建主题
bin/kafka-topics.sh --zookeeper 192.168.113.11:2181/kafka --create --replication-factor 1 --partitions 1 --topic flink_test2
启动生产者-发出测试数据
bin/kafka-console-producer.sh --broker-list 192.168.113.11:9092 --topic flink_test2
测试数据格式为:
{“id”:1,“name”:“查询kafka后存储到cvs文件中”}

启动消费者-检测是否接受到数据
实例代码
本应用采用pyflink+sql方式编写代码。

#!/usr/bin/python3.7

-- coding: UTF-8 --

from pyflink.datastream import StreamExecutionEnvironment, CheckpointingMode
from pyflink.table import StreamTableEnvironment, TableConfig, DataTypes, CsvTableSink, WriteMode, SqlDialect
s_env = StreamExecutionEnvironment.get_execution_environment()
s_env.set_parallelism(1)
#必须开启checkpoint,时间间隔为毫秒,否则不能输出数据
s_env.enable_checkpointing(3000)

st_env = StreamTableEnvironment.create(s_env, TableConfig())
st_env.use_catalog(“default_catalog”)
st_env.use_database(“default_database”)
sourceKafkaDdl = “”"
create table sourceKafka(
id int comment ‘序号’,
name varchar comment ‘姓名’
)comment ‘从kafka中源源不断获取数据’
with(
‘connector’ = ‘kafka’,
‘topic’ = ‘flink_test2’,
‘properties.bootstrap.servers’ = ‘192.168.113.11:9092’,
‘scan.startup.mode’ = ‘earliest-offset’,
‘format’ = ‘json’
)
“”"
st_env.execute_sql(sourceKafkaDdl)

fieldNames = [“id”, “name”]
fieldTypes = [DataTypes.INT(), DataTypes.STRING()]
csvSink = CsvTableSink(fieldNames, fieldTypes, “/root/tiamaes/result.csv”, “,”, 1, WriteMode.OVERWRITE)
st_env.register_table_sink(“csvTableSink”, csvSink)

resultQuery = st_env.sql_query(“select * from sourceKafka”)
resultQuery.insert_into(“csvTableSink”)

st_env.execute(“pyflink-kafka-v2”)

保存文件为pyflink_kafka.py

代码执行
采用local-single部署模式执行:
python pyflink_kafka.py
持续检查result.cvs的内容:
tail –f result.cvs

执行时没有错误日志时,在result.cvs能持续看到通过kafka生产者发生的数据。

你可能感兴趣的:(flink,hadoop,python,kafka)