版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 (作者:张华 发表于:2018-11-06)
neutron designate日志中发现错误"Got lower serial for", 并且创建zone时永远停留在PENDING状态.
minidns中看到下列日志:
var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.016 94713 INFO designate.mdns.handler [req-26d8910d-61d5-4fd6-bc6b-df7acaf36c12 - - - - -] NotFound, refusing. Question was xxx.openstack-au-east-2.oc.xxx.com. IN SOA
var/log/designate/designate-mdns.log.2.gz:2018-10-23 23:27:36.024 94713 WARNING designate.mdns.handler [req-fa5518dd-9506-44f8-a2ee-ee7d79ffaa3c - - - - -] ZoneNotFound while handling axfr request. Question was xxx.openstack-au-east-2.oc.xxx.com. IN AXFR: ZoneNotFound: Could not find Zone
根据代码[1], 查询DB时报ZoneNotFound从而导致无法构建SOA无法构建axfr response, 从而导致minidns master与bind9 slave无法做zone transfer, 这样导致bind9 slave的serial number也无法更新, 最终在minidns notify时看到"Got lower serial for"
在发生ZoneNotFound的附近有下列日志:
[req-b48612b0-ed3e-46d9-8510-6634282ef0a2 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']
于是, 用下列测试程序重现了上述DB error,
import time
from sqlalchemy.engine import create_engine
url = 'mysql+pymysql://[email protected]:3306/mysql'
engine = create_engine(url, pool_recycle=4).connect()
query = 'SELECT NOW();'
while True:
print('Q1', engine.execute(query).fetchall())
engine.execute('SET wait_timeout=2')
time.sleep(3)
print('Q2', engine.execute(query).fetchall())
或者使用oslo.db测试程序:
mysql -u root -p -e "SET GLOBAL wait_timeout=5, slow_query_log=on, long_query_time=0.0;"
$ cat test.py
import sys
import time
from oslo_config import cfg
from oslo_db import options as db_options
from oslo_db.sqlalchemy import session as db_session
from sqlalchemy.sql.expression import select
_facade = db_session.EngineFacade("mysql://root@localhost/test")
x = _facade.get_session()
print(x.scalar(select([23])))
time.sleep(5)
print(x.scalar(select([23])))
连接线的过期时间(pool_recycle, 连接池里的连接空闲一段时间后自动释放, oslo中默认是3600)不能大于服务器端的wait_timeout时间(这里是2, 默认是8小时). 所以应该设置wait_timeout大于3600, 或者haproxy中的配置大于3600.
wait_timeout是mysql的一个设置,主要是用来断开不使用的数据库连接。当连接空闲的时间达到wait_timeout设置的最大值时,mysql会主动切断这个连接,以供别的客户端连接数据库。这个值一般是28800,也就是8小时。在mysql中可以通过: show variables like “%timeout%”;获取。
另外,当数据库主动切断连接的时候,mysql客户端并不知道这个连接已经被切断,所以程序并不知道其已经无效了,如果mysql客户端再不支持ReConnect,双重的问题叠加在一起就会导致连接池返回无效连接的可能.
mysqldump --single-transaction -u root -p designate --skip-extended-insert > /tmp/designate-$(date +%s).sql
For the error ‘Lost connection to MySQL server during query’,
wait_timeout is a setting of mysql, which is used for server side to disconnect unused client connections proactively. The default is 8 hours(28800), we can check it by ‘show variables like “%timeout%”;’ or ‘juju config mysql wait-timeout’
All the rest of the timeout below should be less than wait_timeout, if they are greater than wait_timeout, the error ‘Lost connection to MySQL server during query’ will happen.
1, the default value of olso’s connection_recycle_time is 3600
2, the timeout in haproxy.cfg
So we need to collect the following info for further analyses.
1, juju config mysql wait-timeout
2, mysql -unova -p -h
3, juju ssh neutron-api/0 – sudo grep -r ‘connection_recycle_time’ /etc/neutron/
4, juju ssh neutron-api/0 – sudo grep -r ‘timeout’ /etc/haproxy/
#This will show which openstack service is using the most mysql connections
select user, count(*) from information_schema.processlist group by user;
juju run --application mysql leader-get
juju run --application mysql "mysql -uroot -pChangeMe123 -e \"SELECT IFNULL(usr,'All Users') user,IFNULL(hst,'All Hosts') host,COUNT(1) Connections FROM (SELECT user usr,LEFT(host,LOCATE(':',host) - 1) hst FROM information_schema.processlist WHERE user NOT IN ('system user','root')) A GROUP BY usr,hst WITH ROLLUP;\""
openstack的各服务都有大量的这种错误:
var/log/mysql/error.log:2019-08-29T16:07:45.335406Z 420821 [Note] Aborted connection 420821 to db: ‘keystone’ user: ‘keystone’ host: ‘10.191.5.49’ (Got timeout reading communication packets)
要求客户增大了connect_timeout, net_read_timeout, net_write_timeout, interactive_timeout后使用‘show global status like ‘aborted%’;’看到Aborted_clients的数目仍在增大
show global variables like ‘max_conn%’;
show global variables like ‘%timeout%’;
show global status like ‘aborted%’;
show processlist;
这时发现在21:55:48, 所有的designate serveres都收到了mysql错误, 通过mysql的查询日志找到21:04到21:16之间mysql收到了40个gnocchi查询,一个查询就要返回500MB的大小, 然后看到了警告:InnoDB: Warning: difficult to find free blocks in the buffer pool (338 search iterations)!"
所以看起来像是IO swamp的问题进而造成查询超时。解决方法:
1, 增加innodb_buffer_pool_size (juju中通过dataset-size控制)
2, 删除gnocchi中的相关数据
3, 调查designate中当遇到mysql超时时是不是会反复retry造成对DB的flood
[1] https://github.com/openstack/designate/blob/stable/queens/designate/mdns/handler.py#L233