引
端午假期回来,同事往账号中充钱,意外发现充钱失败
奇怪的是之前的各个测试流程都通过了,之前没有发现任何bug
bug初现
第一反应查看日志
ubuntu@ip-172-31-16-234:~$ tail -f /var/log/supervisor/poll_block-err.log
result = self._query(query)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/cursors.py", line 321, in _query
conn.query(q)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 859, in query
self._execute_command(COMMAND.COM_QUERY, sql)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1096, in _execute_command
self._write_bytes(packet)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1048, in _write_bytes
"MySQL server has gone away (%r)" % (e,))
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))") [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'trans_type': 'in', 'wallet_address': '0x2d9272772978AAaC6690994D596811f1966CCC80', 'dt_create': datetime.datetime(2018, 6, 19, 15, 1, 57, 309124), 'phone_num': '+8618813002365', 'eth_num': Decimal('1'), 'to_address': '', 'txn_hash': '0xb6a7f52c9aef9f000dc979b8b1a58b92c0174d646156d50576079ee0d679e1a2', 'from_address': '0x23e96cA25136Fb6444B08FB24f94554197620c66'}] (Background on this error at: http://sqlalche.me/e/e3q8)
情况阐述,
此接口为监听用户打币情况接口,为异步操作,单开了一个app对象用来监听打币情况。
业务逻辑为,监听到有用户打币,那么用户的钱包地址中就增加相应的数量。
接下来继续测试,第二笔和第三笔都没有问题,都正常。
奇怪的地方在于,第一笔失败,在没有任何人操作的情况下,继续打了第二笔和第三笔,竟然成功。
bug分析
- 三次打币有一个失败,肯定是和打币的变量有关,可是打币的变量只有一个,那就是打币金额
首先来测试,跟失败的例子使用相同的打币金额会不会成功。
经测试,打相同数量的币并没有触发这个bug,奇怪。
- 另外一个变量就是,中途放了假,test服务器没有任何人去使用。难道。。。?跟一些超时参数有关?
果断搜索time out 相关的参数,意外发现:
参考 https://dev.mysql.com/doc/refman/5.7/en/gone-away.html
By default, the server closes the connection after eight hours if nothing has happened. You can change the time limit by setting the [wait_timeout](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_wait_timeout) variable when you start [**mysqld**](https://dev.mysql.com/doc/refman/5.7/en/mysqld.html). See [Section 5.1.7, “Server System Variables”](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html).
目前来看,八个小时如果没有任何操作会自动断掉connection,貌似有这个的可能。
bug复现
接下来尝试来复现这个bug
设置MySQL timeout 变量 https://blog.csdn.net/liuj2511981/article/details/8520912
把wait timeout 时间缩小到几秒钟。
set global wait_timeout=10;
show global variables like '%timeout%';
果然复现出来了
File "/home/ubuntu/.virtualenvs/coucou/local/lib/python3.5/site-packages/pymysql/connections.py", line 1037, in _read_bytes
CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 20, 10, 55, 58, 846276), 'trans_type': 'in', 'eth_num': Decimal('0.1'), 'txn_hash': '0x6665b259dacf1de2df847f8812667e6af2f32ae8f5fef8f41a6dfdc3b9d9506b', 'phone_num': '+8613146963300'}] (Background on this error at: http://sqlalche.me/e/e3q8)
深度观察
虽然我们能准确的复现出来了,但是目前为止还不知道原因。
既然是跟超时有关,我们来查看程序与mysql的链接情况。
首先我们把时间改成2s
set global wait_timeout=2;
我们使用下面这个命令,来查看程序与mysql的连接产生和断开的情况:
mysqladmin -u dbu -p -i '1' processlist
+----+------+-----------+----+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+----+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
+----+------+-----------+----+---------+------+----------+------------------+
+----+------+-----------+----+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+----+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
+----+------+-----------+----+---------+------+----------+------------------+
+----+------+-----------------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+--------+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
| 45 | dbu | localhost:52478 | master | Sleep | 1 | | |
+----+------+-----------------+--------+---------+------+----------+------------------+
+----+------+-----------------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+--------+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
| 45 | dbu | localhost:52478 | master | Sleep | 2 | | |
+----+------+-----------------+--------+---------+------+----------+------------------+
也就是说,这个连接的存活期只有2s,当2后这个链接自动关闭掉了。
而之前没有碰到过这个情况的原因是,很少有程序都是即时返回的。程序的生存周期只有不到一秒,
而这个脚本是要跨超长时间的,需要一直维持着mysql的连接,而且从开始运行到监听到用户打币都是默认使用的一个连接。
当正在检测到了用户打币的时候,之前的链接早已超时,
于是产生了报错:
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'txn_hash': '0xf02316210853e2b791c9fb67535a210d85dcd61676277aab7953c10a91d6c138', 'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 21, 11, 25, 22, 222526), 'trans_type': 'in', 'phone_num': '+8613146963300', 'eth_num': Decimal('0.1')}] (Background on this error at: http://sqlalche.me/e/e3q8)
同样的,很多用户执行长时间sql脚本的时候也经常碰到这个bug。
解决方式
1 poll_block进程中,循环去维持一个跟数据库的连接。
2 把wait timeout 值设大些。
延伸。mysql 连接相关。
status 里面的connects实践。
当我业务程序没有启动的时候,用linux客户端进入mysql
mysql dbu@(none):coucou> show processlist;
+----+------+-----------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+--------+---------+------+----------+------------------+
| 9 | dbu | localhost | coucou | Query | 0 | starting | show processlist |
+----+------+-----------+--------+---------+------+----------+------------------+
可以看到,当前只有一个连接,那就是本客户端的链接,而且id为9
当启动业务程序后:
gunicorn 'coucou.application:create_app()' -w 8 -b 0.0.0.0:5000
可以看到,我开了8 个线程。这时候再去看数据库的链接。
mysql dbu@(none):coucou> show processlist;
+----+------+-----------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+--------+---------+------+----------+------------------+
| 9 | dbu | localhost | coucou | Query | 0 | starting | show processlist |
+----+------+-----------+--------+---------+------+----------+------------------+
还是只有一个啊,因为这个时候app 与 mysql还没有交互。
我试着走一个接口。
mysql dbu@(none):coucou> show processlist;
+----+------+-----------------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+--------+---------+------+----------+------------------+
| 9 | dbu | localhost | coucou | Query | 0 | starting | show processlist |
| 19 | dbu | localhost:52392 | master | Sleep | 2 | | |
+----+------+-----------------+--------+---------+------+----------+------------------+
2 rows in set
会发现多出来了一个连接。
然后我再多走2个接口,发现不是每个接口都会增长一个连接。
最后我走一百个接口,发现,最多也就是9个连接,不会再增长了。
现象分析
每个请求会被分配到gunicoin的一个线程里面,每个gunicoin所托起的app一旦开始运行都会维持一个跟mysql的连接。
因为每个请求分到线程都是随机的。也就是不一定每个接口都会增长一个连接。
因为最多有8个线程,也就是最多会有8+1 个连接,其中1是客户端的链接。
status命令确认
mysql dbu@(none):coucou> status
--------------
mycli 1.16.0, running on CPython 2.7.12
+----------------------+----------------------------------+
| | |
+----------------------+----------------------------------+
| Connection id: | 9 |
| Current database: | coucou |
| Current user: | dbu@localhost |
| Current pager: | System default |
| Server version: | 5.7.22-0ubuntu0.16.04.1 (Ubuntu) |
| Protocol version: | 10 |
| Connection: | Localhost via UNIX socket |
| Server characterset: | utf8mb4 |
| Db characterset: | utf8mb4 |
| Client characterset: | utf8 |
| Conn. characterset: | utf8 |
| UNIX socket: | /var/run/mysqld/mysqld.sock |
| Uptime: | 14 hours 20 min 53 sec |
+----------------------+----------------------------------+
Connections: 9 Queries: 1236 Slow queries: 0 Opens: 220 Flush tables: 1 Open tables: 129 Queries per second avg: 0.000
Statsu显示页面里面 也是显示当前连接有9个。而且客户端的连接id为9;