Code:[DBUtilErrorCode-06], Description:[执行数据库 Sql 失败, 请检查您的配置的 column/table/where/querySql或者向 DBA 寻求帮助.].
执行的SQL为: select id,...,create_time from mytabs where (create_time >= '2022-10-31 07:05:00')
具体错误信息为:java.sql.SQLException: java.io.IOException: Reached end of input stream after reading 42 of 104 bytes
JDBC连接串为 jdbc:clickhouse://xxxx:8123/mydb?socket_timeout=300000
select * from clusterAllReplicas(default,system,query_log) where query like '%create_time from mytabs where (create_time >= \'2022-10-31 07:05:00\')%' order by event_time desc limit 5;
select _shard_num, * from clusterAllReplicas(default, system, query_log) where exception_code > 0 and exception_code not in (13009, 341) and http_user_agent='ClickHouse Java Client' order by event_time desc limit 5 format Vertical;
推测该报错与socket timeout有关,并且由业务方提供的jdbc连接串 jdbc:clickhouse://xxxx:8123/mydb?socket_timeout=300000 可以看到,设置了socket_timeout=300000
关于 socket_timeout 的介绍如下 JDBC 中 socketTimeout 的作用 - 范兵 - 博客园
但是,报错的语句是一个select * 的查询,而不是聚合分析语句,并且手动执行很快也有数据返回,不太符合上面的情况。
跟开发了解到该业务是一个数据同步场景,利用数据同步工具(jdbc程序)从ch查询数据,同步(按主键update)到 pg。
由于select总结果集将近3000万行数据,jdbc不可能是一次取完,并且取出后的结果需要到目标库按主键update,怀疑是目标端处理速度较慢,引起了socket timeout。
由此可以得到一个优化方案:ch端的查询改为按主键group by,只取最新值同步到pg,可以将同步数据量由3000万大幅降低至60万,同时减少同步时间和各端负载。
在 13:44:23到14:30:51,开发再次执行该语句并遇到报错,阿里云查询后台debug日志如下:
2022.11.04 13:44:23.746012 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} executeQuery: (from xxxx:44810, user: xxx) select id,...,create_time from mytabs where (create_time >= '2022-10-31 07:05:00')
2022.11.04 13:44:23.747047 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} mytabs_local (SelectExecutor): Key condition: (column 0 in [1667171100, +inf))
2022.11.04 13:44:23.747061 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} mytabs_local (SelectExecutor): MinMax index condition: unknown
2022.11.04 13:44:23.747388 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} mytabs_local (SelectExecutor): Selected 12 parts by date, 10 parts by key, 776 marks by primary key, 776 marks to read from 10 ranges
2022.11.04 13:44:23.747808 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} mytabs_local (SelectExecutor): Reading approx. 5848758 rows with 15 streams
2022.11.04 14:30:51.968534 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} executeQuery: Code: 210, e.displayText() = DB::NetException: Connection reset by peer, while reading from socket (xxx:3003): while receiving packet from xxx:3003: While executing Remote (version (from xxx:44810) (in query: select id,...,create_time from mytabs where (create_time >= '2022-10-31 07:05:00')), Stack trace (when copying this message, always include the lines below):
2022.11.04 14:30:51.974662 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} DynamicQueryHandler: Code: 210, e.displayText() = DB::NetException: Connection reset by peer, while reading from socket (xxx:3003): while receiving packet from xxx:3003: While executing Remote, Stack trace (when copying this message, always include the lines below):
2022.11.04 14:30:51.974884 [ 6247 ] {be8a2988-a624-4adb-b220-2794adce016c} MemoryTracker: Peak memory usage (for query): 313.65 MiB.
2022.11.04 13:44:23.750101 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} executeQuery: (from xxx:62298, initial_query_id: be8a2988-a624-4adb-b220-2794adce016c) SELECT id, xxx, create_time FROM mytabs_local WHERE create_time >= '2022-10-31 07:05:00'
2022.11.04 13:44:23.750946 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} mytabs_local (SelectExecutor): Key condition: (column 0 in [1667171100, +inf))
2022.11.04 13:44:23.750961 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} mytabs_local (SelectExecutor): MinMax index condition: unknown
2022.11.04 13:44:23.751277 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} mytabs_local (SelectExecutor): Selected 15 parts by date, 12 parts by key, 786 marks by primary key, 786 marks to read from 12 ranges
2022.11.04 13:44:23.759369 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} mytabs_local (SelectExecutor): Reading approx. 5871717 rows with 15 streams
2022.11.04 13:56:30.605183 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} executeQuery: Code: 209, e.displayText() = DB::NetException: Timeout exceeded while writing to socket (xxx:62298) (version (from xxx:62298) (in query: SELECT id, xxx, create_time FROM mytabs_local WHERE create_time >= '2022-10-31 07:05:00'), Stack trace (when copying this message, always include the lines below):
2022.11.04 13:56:56.520048 [ 6454 ] {40fe2280-7c75-48f7-ae20-d9f573179550} MemoryTracker: Peak memory usage (for query): 323.68 MiB.
建议ch端的查询改为按主键group by,只取最新值同步到pg,可以将同步数据量由3000万大幅降低至60万,同时减少同步时间和各端负载
由于本次报错首先并不发生在jdbc端,因此调整jdbc socket_timeout并不解决该问题
ch的 send_timeout 与 receive_timeout 存在于client与server端之间,不仅限于分布式表与local表间,因此单副本模式同样可能出现该问题
上图中的root server和child server两端都可以关闭tcp连接:
ClickHouse 源码阅读 —— SQL的前世今生-阿里云开发者社区
ClickHouse源码阅读(0000 1001) —— CK Server对SQL的处理 | 码农家园
ClickHouse源码阅读(0000 1100) —— TCPHandler.cpp中的in和out分析_B_e_a_u_tiful1205的博客-CSDN博客
ClickHouse源码阅读(0000 1011) —— ClickHouse Client端如何接收Server端发送回来的数据_B_e_a_u_tiful1205的博客-CSDN博客