flink python 从csv读取,写入csv

环境

  • centos 6.5
  • Python 2.7
  • CDH 5.15
  • flink 1.9

获得pyflink库

pyflink库在flink安装路径opt/python下

$ cd /usr/local/flink/opt/python
$ cp pyflink.zip py4j-0.10.8.1-src.zip /opt/test
$ cd /opt/test
$ unzip pyflink.zip
$ unzip py4j-0.10.8.1-src.zip

程序架构

  • 创建TableEnvironment,定义planner,batch或streaming
  • 注册输入表
  • 注册输出表
  • 从Table API 或SQL 查询创建表
  • 发送查询结果到TableSink
  • 执行

flink流式查询

准备源数据

$ vi streaming.csv
1, 'hi', 'hello'
2, 'hi', 'hello'

流式查询

$ vi test.py
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment, EnvironmentSettings, DataTypes, CsvTableSource, CsvTableSink

# Create a TableEnvironment
flink_stream_env = StreamExecutionEnvironment.get_execution_environment()
flink_stream_settings = EnvironmentSettings.new_instance().use_old_planner().in_streaming_mode().build()
flink_stream_table_env = StreamTableEnvironment.create(flink_stream_env, environment_settings=flink_stream_settings)

# define the field names and types
field_names = ["a", "b", "c"]
field_types = [DataTypes.INT(), DataTypes.STRING(), DataTypes.STRING()]

# create a TableSource
csv_source = CsvTableSource("streaming.csv", field_names, field_types)

# Register a TableSource
flink_stream_table_env.register_table_source("csvTable", csv_source)

# create a TableSink
csv_sink = CsvTableSink(field_names, field_types, "result.csv", "|", 1, "WriteMode.OVERWRITE")

# register the TableSink
flink_stream_table_env.register_table_sink("CsvSinkTable", csv_sink)

# compute csv_source
result = flink_stream_table_env.sql_query("select a+1,b,c from csvTable")

# emit the result Table to the registered TableSink
result.insert_into("CsvSinkTable")

# execute
flink_stream_table_env.execute("python_job")

运行python

$ python test.py

运行会报错

Exception: Unsupported write_mode: WriteMode.OVERWRITE

需要修改源码

$ vi pyflink/table/sinks.py
# 定位到class CsvTableSink(TableSink):
        if write_mode == WriteMode.NO_OVERWRITE:
            j_write_mode = gateway.jvm.org.apache.flink.core.fs.FileSystem.WriteMode.NO_OVERWRITE
        elif write_mode == WriteMode.OVERWRITE:
            j_write_mode = gateway.jvm.org.apache.flink.core.fs.FileSystem.WriteMode.OVERWRITE

# 在WriteMode.NO_OVERWRITE和WriteMode.OVERWRITE 加上双引号
        if write_mode == "WriteMode.NO_OVERWRITE":
            j_write_mode = gateway.jvm.org.apache.flink.core.fs.FileSystem.WriteMode.NO_OVERWRITE
        elif write_mode == "WriteMode.OVERWRITE":
            j_write_mode = gateway.jvm.org.apache.flink.core.fs.FileSystem.WriteMode.OVERWRITE

查看结果

$ less result.csv
2| 'hi'| 'hello'
3| 'hi'| 'hello'

参考:
Concepts & Common API

你可能感兴趣的:(大数据开发,#,flink)