Bug 记:一个因为放了三天小长假产生的bug

端午假期回来,同事往账号中充钱,意外发现充钱失败

奇怪的是之前的各个测试流程都通过了,之前没有发现任何bug

bug初现

第一反应查看日志


ubuntu@ip-172-31-16-234:~$ tail -f /var/log/supervisor/poll_block-err.log

 result = self._query(query)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/cursors.py", line 321, in _query

 conn.query(q)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 859, in query

 self._execute_command(COMMAND.COM_QUERY, sql)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1096, in _execute_command

 self._write_bytes(packet)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1048, in _write_bytes

 "MySQL server has gone away (%r)" % (e,))

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))") [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'trans_type': 'in', 'wallet_address': '0x2d9272772978AAaC6690994D596811f1966CCC80', 'dt_create': datetime.datetime(2018, 6, 19, 15, 1, 57, 309124), 'phone_num': '+8618813002365', 'eth_num': Decimal('1'), 'to_address': '', 'txn_hash': '0xb6a7f52c9aef9f000dc979b8b1a58b92c0174d646156d50576079ee0d679e1a2', 'from_address': '0x23e96cA25136Fb6444B08FB24f94554197620c66'}] (Background on this error at: http://sqlalche.me/e/e3q8)

情况阐述,

此接口为监听用户打币情况接口,为异步操作,单开了一个app对象用来监听打币情况。

业务逻辑为,监听到有用户打币,那么用户的钱包地址中就增加相应的数量。

接下来继续测试,第二笔和第三笔都没有问题,都正常。

奇怪的地方在于,第一笔失败,在没有任何人操作的情况下,继续打了第二笔和第三笔,竟然成功。

bug分析

  • 三次打币有一个失败,肯定是和打币的变量有关,可是打币的变量只有一个,那就是打币金额

首先来测试,跟失败的例子使用相同的打币金额会不会成功。

经测试,打相同数量的币并没有触发这个bug,奇怪。

  • 另外一个变量就是,中途放了假,test服务器没有任何人去使用。难道。。。?跟一些超时参数有关?

果断搜索time out 相关的参数,意外发现:

参考 https://dev.mysql.com/doc/refman/5.7/en/gone-away.html


By default, the server closes the connection after eight hours if nothing has happened. You can change the time limit by setting the [wait_timeout](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_wait_timeout) variable when you start [**mysqld**](https://dev.mysql.com/doc/refman/5.7/en/mysqld.html). See [Section 5.1.7,  “Server System Variables”](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html).

目前来看,八个小时如果没有任何操作会自动断掉connection,貌似有这个的可能。

bug复现

接下来尝试来复现这个bug

设置MySQL timeout 变量 https://blog.csdn.net/liuj2511981/article/details/8520912

把wait timeout 时间缩小到几秒钟。


set global wait_timeout=10;

show global variables like '%timeout%';

果然复现出来了


 File "/home/ubuntu/.virtualenvs/coucou/local/lib/python3.5/site-packages/pymysql/connections.py", line 1037, in _read_bytes

 CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 20, 10, 55, 58, 846276), 'trans_type': 'in', 'eth_num': Decimal('0.1'), 'txn_hash': '0x6665b259dacf1de2df847f8812667e6af2f32ae8f5fef8f41a6dfdc3b9d9506b', 'phone_num': '+8613146963300'}] (Background on this error at: http://sqlalche.me/e/e3q8)

深度观察

虽然我们能准确的复现出来了,但是目前为止还不知道原因。

既然是跟超时有关,我们来查看程序与mysql的链接情况。

首先我们把时间改成2s


 set global wait_timeout=2;

我们使用下面这个命令,来查看程序与mysql的连接产生和断开的情况:


mysqladmin -u dbu -p -i '1' processlist


+----+------+-----------+----+---------+------+----------+------------------+

| Id | User | Host | db | Command | Time | State | Info  |

+----+------+-----------+----+---------+------+----------+------------------+

| 44 | dbu | localhost | | Query  | 0 | starting | show processlist |

+----+------+-----------+----+---------+------+----------+------------------+

+----+------+-----------+----+---------+------+----------+------------------+

| Id | User | Host | db | Command | Time | State | Info  |

+----+------+-----------+----+---------+------+----------+------------------+

| 44 | dbu | localhost | | Query  | 0 | starting | show processlist |

+----+------+-----------+----+---------+------+----------+------------------+

+----+------+-----------------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------------+--------+---------+------+----------+------------------+

| 44 | dbu | localhost  | | Query  | 0 | starting | show processlist |

| 45 | dbu | localhost:52478 | master | Sleep  | 1 | | |

+----+------+-----------------+--------+---------+------+----------+------------------+

+----+------+-----------------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------------+--------+---------+------+----------+------------------+

| 44 | dbu | localhost  | | Query  | 0 | starting | show processlist |

| 45 | dbu | localhost:52478 | master | Sleep  | 2 | | |

+----+------+-----------------+--------+---------+------+----------+------------------+

也就是说,这个连接的存活期只有2s,当2后这个链接自动关闭掉了。

而之前没有碰到过这个情况的原因是,很少有程序都是即时返回的。程序的生存周期只有不到一秒,

而这个脚本是要跨超长时间的,需要一直维持着mysql的连接,而且从开始运行到监听到用户打币都是默认使用的一个连接。

当正在检测到了用户打币的时候,之前的链接早已超时,

于是产生了报错:


sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'txn_hash': '0xf02316210853e2b791c9fb67535a210d85dcd61676277aab7953c10a91d6c138', 'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 21, 11, 25, 22, 222526), 'trans_type': 'in', 'phone_num': '+8613146963300', 'eth_num': Decimal('0.1')}] (Background on this error at: http://sqlalche.me/e/e3q8)

同样的,很多用户执行长时间sql脚本的时候也经常碰到这个bug。

解决方式

1 poll_block进程中,循环去维持一个跟数据库的连接。

2 把wait timeout 值设大些。

延伸。mysql 连接相关。

status 里面的connects实践。

当我业务程序没有启动的时候,用linux客户端进入mysql


mysql dbu@(none):coucou> show processlist;

+----+------+-----------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------+--------+---------+------+----------+------------------+

| 9 | dbu | localhost | coucou | Query  | 0 | starting | show processlist |

+----+------+-----------+--------+---------+------+----------+------------------+

可以看到,当前只有一个连接,那就是本客户端的链接,而且id为9

当启动业务程序后:


gunicorn 'coucou.application:create_app()' -w 8 -b 0.0.0.0:5000

可以看到,我开了8 个线程。这时候再去看数据库的链接。


mysql dbu@(none):coucou> show processlist;

+----+------+-----------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------+--------+---------+------+----------+------------------+

| 9 | dbu | localhost | coucou | Query  | 0 | starting | show processlist |

+----+------+-----------+--------+---------+------+----------+------------------+

还是只有一个啊,因为这个时候app 与 mysql还没有交互。

我试着走一个接口。


mysql dbu@(none):coucou> show processlist;

+----+------+-----------------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------------+--------+---------+------+----------+------------------+

| 9 | dbu | localhost  | coucou | Query  | 0 | starting | show processlist |

| 19 | dbu | localhost:52392 | master | Sleep  | 2 | |   |

+----+------+-----------------+--------+---------+------+----------+------------------+

2 rows in set

会发现多出来了一个连接。

然后我再多走2个接口,发现不是每个接口都会增长一个连接。

最后我走一百个接口,发现,最多也就是9个连接,不会再增长了。

现象分析

每个请求会被分配到gunicoin的一个线程里面,每个gunicoin所托起的app一旦开始运行都会维持一个跟mysql的连接。

因为每个请求分到线程都是随机的。也就是不一定每个接口都会增长一个连接。

因为最多有8个线程,也就是最多会有8+1 个连接,其中1是客户端的链接。

status命令确认


mysql dbu@(none):coucou> status

--------------

mycli 1.16.0, running on CPython 2.7.12

+----------------------+----------------------------------+

| | |

+----------------------+----------------------------------+

| Connection id:  | 9 |

| Current database: | coucou  |

| Current user: | dbu@localhost |

| Current pager:  | System default  |

| Server version: | 5.7.22-0ubuntu0.16.04.1 (Ubuntu) |

| Protocol version: | 10  |

| Connection: | Localhost via UNIX socket |

| Server characterset: | utf8mb4 |

| Db characterset:  | utf8mb4 |

| Client characterset: | utf8  |

| Conn. characterset: | utf8  |

| UNIX socket:  | /var/run/mysqld/mysqld.sock |

| Uptime: | 14 hours 20 min 53 sec  |

+----------------------+----------------------------------+

Connections: 9 Queries: 1236 Slow queries: 0 Opens: 220 Flush tables: 1 Open tables: 129 Queries per second avg: 0.000

Statsu显示页面里面 也是显示当前连接有9个。而且客户端的连接id为9;

你可能感兴趣的:(Bug 记:一个因为放了三天小长假产生的bug)