seatunnel 是一个非常易用,高性能、支持实时流式和离线批处理的海量数据处理产品,架构于Apache Spark 和 Apache Flink之上,每天可以稳定高效同步数百亿数据,已在近百家公司生产上使用。
可以直接运行的软件包下载地址:https://github.com/InterestingLab/seatunnel/releases
快速入门:https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/quick-start
关于 seatunnel 的详细文档
seatunnel 尽所能为您解决海量数据同步中可能遇到的问题:
Input[数据源输入] -> Filter[数据处理] -> Output[结果输出]
多个 Filter 构建了数据处理的 Pipeline,满足各种各样的数据处理需求,如果您熟悉 SQL,也可以直接通过 SQL 构建数据处理的 Pipeline,简单高效。目前 seatunnel 支持的Filter列表, 仍然在不断扩充中。您也可以开发自己的数据处理插件,整个系统是易于扩展的。
Fake, File, Hdfs, Kafka, S3, Socket, 自行开发的 Input plugin
Add, Checksum, Convert, Date, Drop, Grok, Json, Kv, Lowercase, Remove, Rename, Repartition, Replace, Sample, Split, Sql, Table, Truncate, Uppercase, Uuid, 自行开发的Filter plugin
Elasticsearch, File, Hdfs, Jdbc, Kafka, Mysql, S3, Stdout, 自行开发的 Output plugin
如果您的数据量较小或者只是做功能验证,也可以仅使用 local 模式启动,无需集群环境,seatunnel 支持单机运行。 注: seatunnel 2.0 支持 Spark 和 Flink 上运行
wget https://github.com/InterestingLab/seatunnel/releases/download/v<version>/seatunnel-<version>.zip -O seatunnel-<version>.zip
unzip seatunnel-<version>.zip
ln -s seatunnel-<version> seatunnel
spark {
spark.sql.catalogImplementation = "hive"
spark.app.name = "hive2clickhouse"
spark.executor.instances = 30
spark.executor.cores = 1
spark.executor.memory = "2g"
spark.ui.port = 13000
}
input {
hive {
pre_sql = "select id,name,create_time from table"
table_name = "table_tmp"
}
}
filter {
convert {
source_field = "data_source"
new_type = "UInt8"
}
org.interestinglab.waterdrop.filter.Slice {
source_table_name = "table_tmp"
source_field = "id"
slice_num = 2
slice_code = 0
result_table_name = "table_8123"
}
org.interestinglab.waterdrop.filter.Slice {
source_table_name = "table_tmp"
source_field = "id"
slice_num = 2
slice_code = 1
result_table_name = "table_8124"
}
}
output {
clickhouse {
source_table_name="table_8123"
host = "ip1:8123"
database = "db_name"
username="username"
password="pwd"
table = "model_score_local"
fields = ["id","name","create_time"]
clickhouse.socket_timeout = 50000
retry_codes = [209, 210]
retry = 3
bulk_size = 500000
}
clickhouse {
source_table_name="table_8124"
host = "ip2:8123"
database = "db_name"
username="username"
password="pwd"
table = "model_score_local"
fields = ["id","name","create_time"]
clickhouse.socket_timeout = 50000
retry_codes = [209, 210]
retry = 3
bulk_size = 500000
}
}
../bin/start-waterdrop.sh --master local[4] --deploy-mode client --config.conf
更多案例参见: https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/case_study/