在部分12580项目中,我引入了haproxy做为软负载。主要为一些非关键性的业务提供服务。架构比较简单。两台web前段,通过haproxy调用后端的两台mysql(slave),主要用于搜索。不写入数据。
自从上线后,开发人员就偶尔反馈说,有时打开web页面查询数据,报错如下:
HTTP Status 500 - Request processing failed; nested exception is org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
type Exception report
message Request processing failed; nested exception is org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
description The server encountered an internal error that prevented it from fulfilling this request.
exception
org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
一开始以为是前端web服务器的spring组件问题。后来跳过haproxy直接连接mysql数据。没有问题。说明通过haproxy才出现此问题。
下面是haproxy的配置,其实非常简单:
frontend acdb_ms3306
bind 192.100.2.247:3306
mode tcp
option tcplog
maxconn 20000
default_backend backacdb_ms3306
backend backacdb_ms3306
mode tcp
balance roundrobin
server web01 172.200.2.3:3306 check inter 2000 fall 3
server web02 172.200.2.5:3306 check inter 2000 fall 3
百思不得其解。只能查看日志。内容大致都是如下的各式:
192.100.2.43:50597 [30/Jun/2015:21:57:40.094] acdb_ms3306 backacdb_ms3306/web01 1/1/83378 1350324 cD 2/2/2/1/0 0/0
192.100.2.41:56074 [30/Jun/2015:21:57:40.118] acdb_ms3306 backacdb_ms3306/web02 1/1/69986 10387 cD 3/3/3/1/0 0/0
没有发现特别的地方。后来×××google了好多地方。发现一片文章说,修改参数tmeout client。haproxy的日志有个标示就改变了。回头在看看我的,和他一样。在上面的日志中。存在cD标志。但不知道什么意思。搜索haproxy配置文旦,说明如下:
cD The client did not send nor acknowledge any data for as long as the
"timeout client" delay. This is often caused by network failures on
the client side, or the client simply leaving the net uncleanly.
那就设置长一点吧。官方解释如下:
timeout client
Set the maximum inactivity time on the client side.
May be used in sections : defaults | frontend | listen | backend
yes | yes | yes | no
Arguments :
can be in any other unit if the number is suffixed by the unit,
as explained at the top of this document.
这里注意,默认单位是毫秒。
我在default位置修改。
timeout client 90s
备注:
默认配置为timeout client 1m,这里单位应该是1分钟。
官方解释:
- us : microseconds. 1 microsecond = 1/1000000 second - ms : milliseconds. 1 millisecond = 1/1000 second. This is the default. - s : seconds. 1s = 1000ms - m : minutes. 1m = 60s = 60000ms - h : hours. 1h = 60m = 3600s = 3600000ms - d : days. 1d = 24h = 1440m = 86400s = 86400000ms
重启haproxy。再看日志。发现如下:
192.100.2.41:40963 [30/Jun/2015:22:24:02.904] acdb_ms3306 backacdb_ms3306/web01 1/0/20509 9224 sD 2/1/1/0/0 0/0
192.100.2.41:41893 [30/Jun/2015:22:23:56.246] acdb_ms3306 backacdb_ms3306/web02 1/0/30066 11172 sD 2/1/1/0/0 0/0
标示改为sD,啥意思:
sD The server did not send nor acknowledge any data for as long as the
"timeout server" setting during the data phase. This is often caused
by too short timeouts on L4 equipments before the server (firewalls,
load-balancers, ...), as well as keep-alive sessions maintained
between the client and the server expiring first on haproxy.
好吧,再把timeout server 改长一点
timeout server 90s (默认配置为timeout server 1m,解释上同)
再看日志。哈哈。都是ok的。运行了一段时间。再也没有报连接打不开的情况了。
172.200.1.9:26387 [30/Jun/2015:21:59:40.275] acdb_ms3306 backacdb_ms3306/web01 1/0/4 160 -- 0/0/0/0/0 0/0
172.200.1.9:46303 [30/Jun/2015:21:59:40.280] acdb_ms3306 backacdb_ms3306/web02 1/0/11 6094 -- 0/0/0/0/0 0/0