SeaTunnel是一个分布式、高性能、可扩展的数据同步工具,它支持多种数据源之间的数据同步,包括Hive和StarRocks。可以使用SeaTunnel的Hive源连接器从Hive读取外部数据源数据,然后使用StarRocks接收器连接器将数据发送到StarRocks。
通过StarRocks读取外部数据源数据。StarRocks源连接器的内部实现是从前端(FE)获得查询计划,将查询计划作为参数传递给BE节点,然后从BE节点获得数据结果。
名称 | 版本 |
---|---|
StarRocks | 2.4.2 |
SeaTunnel | 2.3.1 |
Spark | 3.2.1 |
Flink | 1.16.1 |
export version="2.3.1"
wget "https://archive.apache.org/dist/incubator/seatunnel/${version}/apache-seatunnel-incubating-${version}-bin.tar.gz"
tar -xzvf "apache-seatunnel-incubating-${version}-bin.tar.gz"
链接:https://pan.baidu.com/s/1nT0BgUutW66cyiu2C_jqIg
提取码:acdy
/bin/bash /app/apache-seatunnel-incubating-2.3.1/bin/install-plugin.sh 2.3.1
vim /app/apache-seatunnel-incubating-2.3.1/config/seatunnel-env.sh
-- 本地模式
/app/apache-seatunnel-incubating-2.3.1/bin/start-seatunnel-spark-3-connector-v2.sh \
-m local[*] \
-e client \
-c /app/apache-seatunnel-incubating-2.3.1/config/seatunnel.streaming.conf.template
-- 集群模式
/app/apache-seatunnel-incubating-2.3.1/bin/start-seatunnel-spark-3-connector-v2.sh \
-m yarn \
-e client \
-c /app/apache-seatunnel-incubating-2.3.1/config/seatunnel.streaming.conf.template
source {
Hive {
#parallelism = 6
table_name = "mid.ads_test_hive_starrocks_ds"
metastore_uri = "thrift://192.168.10.200:9083"
result_table_name = "hive_starrocks_ds"
}
}
transform {
Filter {
source_table_name = "fake"
fields = [name]
result_table_name = "fake_name"
}
Filter {
source_table_name = "fake"
fields = [age]
result_table_name = "fake_age"
}
}
sink {
starrocks {
nodeUrls = ["192.168.10.10:8030","192.168.10.11:8030","192.168.10.12:8030"]
base-url = "jdbc:mysql://192.168.10.10:9030/"
username = root
password = "xxxxxxxxx"
database = "example_db"
table = "ads_test_hive_starrocks_ds"
batch_max_rows = 500000
batch_max_bytes = 104857600
batch_interval_ms = 30000
starrocks.config = {
format = "CSV"
column_separator = "\\x01"
row_delimiter = "\\x02"
}
}
}
cat /app/apache-seatunnel-incubating-2.3.1/config/hive_to_sr2.conf
env {
spark.app.name = "apache-seatunnel-2.3.1_hive_to_sr"
spark.yarn.queue = "root.default"
spark.executor.instances = 2
spark.executor.cores = 4
spark.driver.memory = "3g"
spark.executor.memory = "4g"
spark.ui.port = 1300
spark.sql.catalogImplementation = "hive"
spark.hadoop.hive.exec.dynamic.partition = "true"
spark.hadoop.hive.exec.dynamic.partition.mode = "nonstrict"
spark.network.timeout = "1200s"
spark.sql.sources.partitionOverwriteMode = "dynamic"
spark.yarn.executor.memoryOverhead = 800m
spark.kryoserializer.buffer.max = 512m
spark.task.maxFailures=0
spark.executor.extraJavaOptions = "-Dfile.encoding=UTF-8"
spark.driver.extraJavaOptions = "-Dfile.encoding=UTF-8"
job.name = "apache-seatunnel-2.3.1_hive_to_sr"
}
source {
Hive {
#parallelism = 6
table_name = "mid.ads_test_hive_starrocks_ds"
metastore_uri = "thrift://192.168.10.200:9083"
result_table_name = "hive_starrocks_ds_t1"
}
}
transform {
sql {
query ="select xxx,xxx,xxx,xxx from hive_starrocks_ds_t1 where period_sdate >= '2022-10-31'"
source_table_name = "hive_starrocks_ds_t1"
result_table_name = "hive_starrocks_ds_t2"
}
}
sink {
starrocks {
nodeUrls = ["192.168.10.10:8030","192.168.10.11:8030","192.168.10.12:8030"]
base-url = "jdbc:mysql://192.168.10.10:9030/"
username = root
password = "xxxxxxxxx"
database = "example_db"
table = "ads_test_hive_starrocks_ds"
batch_max_rows = 500000
batch_max_bytes = 104857600
batch_interval_ms = 30000
starrocks.config = {
format = "CSV"
column_separator = "\\x01"
row_delimiter = "\\x02"
}
}
}
sudo -u hive /app/apache-seatunnel-incubating-2.3.1/bin/start-seatunnel-spark-3-connector-v2.sh \
-m yarn \
-e client \
-c /app/apache-seatunnel-incubating-2.3.1/config/hive_to_sr2.conf
sudo -u hive /app/apache-seatunnel-incubating-2.3.1/bin/start-seatunnel-spark-2-connector-v2.sh \
-m yarn \
-e client \
-c /app/apache-seatunnel-incubating-2.3.1/config/hive_to_sr2.conf
设置环境变量来解决中文乱码问题。可以在env中添加以下参数,这将设置Java虚拟机的编码格式为UTF-8,以便正确处理中文字符。
spark.executor.extraJavaOptions = "-Dfile.encoding=UTF-8"
spark.driver.extraJavaOptions = "-Dfile.encoding=UTF-8"
Spark 程序在 YARN 集群上运行时,由于超出了内存限制而丢失了一个执行器。考虑增加 spark.yarn.executor.memoryOverhead 用于指定每个执行器保留的用于内部元数据、用户数据结构和其他堆外内存需求的堆外内存量。该参数的值将添加到执行器内存中,以确定每个执行器对 YARN 的完整内存请求。建议不要将此值设置得过高,因为这可能会导致过多的垃圾收集开销和性能降低。
spark.yarn.executor.memoryOverhead = 800M
错误信息表明您在尝试将数据刷新到 StarRocks 时遇到了问题。在 db 2153532 上运行的事务数为 100,超过了限制 100。可以尝试减少并发事务的数量,以减轻集群的压力。另外可以调整相关参数。
-- 修改事务数
ADMIN SHOW FRONTEND CONFIG ('max_running_txn_num_per_db' = '300')
-- 查看参数是否调整
ADMIN SHOW FRONTEND CONFIG LIKE '%max_running_txn_nu%';
刚开始遇到这个问题挺纳闷,怀疑是Seatunnel设置的任务重试导致starrocks这边数据量变多,看到官方文档有设置重试参数为max_retries,将其设置为0,重跑还是有问题,明明在yarn上面提交的任务application是成功的。
后来查看spark ui看到有3个Failed Tasks之后才明白,原来是spark某几个tasks内存不足导致tasks失败重试,导致sink导入的数据量变多。
解决方案:
1、使用starrocks主键模型进行数据去重特性,保证数据唯一性的(目前官方最新版本2.3.1还不支持sink_starrocks exectly-once保障数据幂等性)
2、将spark参数重试机制设置为0,这样设置后,当任务执行失败时,Spark不再尝试重新执行该任务。
spark.task.maxFailures=0
3、要确定spark.task.maxFailures=0是否在Spark任务中生效,可以查看Spark任务的日志文件。当Spark任务失败时,应该可以在日志中找到类似于以下内容的行:
23/05/12 15:17:55 ERROR scheduler.TaskSetManager: Task 165 in stage 1.0 failed 0 times; aborting job
可以看出任务失败了一次,但是并没有重试,因为日志中的错误信息是Task 165 in stage 1.0 failed 0 times。这是因为将spark.task.maxFailures设置为0,表示任务失败后不进行重试。