pt-online-schema-change最近使用pt-online-schema-change 做线上大表的在线DDL,发现几个问题。
我使用的语句如下:
pt-online-schema-change --user=root --password="xxxxx" --host=192.168.xx.xx D=M_xx,t=T_xx --alter "ADD Fxxxxx'" --charset=utf8 --no-check-replication-filters --alter-foreign-keys-method=auto --recursion-method=none --print --execute
在执行的过程中,有一个库中断了,错误信息如下:
Copying `table_01005`.`T_xxx`: 19% 16:30 remain
Copying `table_01005`.`T_xxx`: 21% 16:21 remain
Copying `table_01005`.`T_xxx`: 22% 16:58 remain
2014-11-04T18:20:25 Dropping triggers...
DROP TRIGGER IF EXISTS `table_01005`.`pt_osc_table_01005_T_xxx_del`;
DROP TRIGGER IF EXISTS `table_01005`.`pt_osc_table_01005_T_xxx_upd`;
DROP TRIGGER IF EXISTS `table_01005`.`pt_osc_table_01005_T_xxx_ins`;
2014-11-04T18:20:28 Dropped triggers OK.
2014-11-04T18:20:28 Dropping new table...
DROP TABLE IF EXISTS `table_01005`.`_T_xxx_new`;
2014-11-04T18:20:30 Dropped new table OK.
`table_01005`.`T_xxx` was not altered.
2014-11-04T18:20:25 Error copying rows from `table_01005`.`T_xxx` to `table_01005`.`_T_xxx_new`: Threads_running=199 exceeds its critical threshold 50。
然后就中断了,这个Threads_running 是活动的线程数。根据这个错误提示,查了下percona的文档,
简单的根据错误提示查看bug列表,找到一个bug,根据现象说是工具本身的一个bug。也正好和我们使用的版本一致,
那我就升级版本,完了再来,然后又重现了,显示不是bug这么简单的事。还是再看错误信息,提示为Threads_running 超过了警告的阀值。既然是阀值,
那是不是可以设置了,找来官网文档仔细瞅瞅,里面有一个参数需要注意,--critical-load 。文档解释如下:
--critical-load
type: Array; default: Threads_running=50
Examine SHOW GLOBAL STATUS after every chunk,
and abort if the load is too high. The option accepts a comma-separated list of MySQL status variables and thresholds.
An optional =MAX_VALUE (or :MAX_VALUE) can follow each variable. If not given,
the tool determines a threshold by examining the current value at startup and doubling it.
See --max-load for further details. These options work similarly,
except that this option will abort the tool’s operation instead of pausing it,
and the default value is computed differently if you specify no threshold.
The reason for this option is as a safety check in case the triggers on the
original table add so much load to the server that it causes downtime.
There is probably no single value of Threads_running that is wrong for
every server, but a default of 50 seems likely to be unacceptably high
for most servers, indicating that the operation should be canceled immediately.
大致的意思如下:
每次chunk操作前后,会根据show global status统计指定的状态量的变化,默认是统计Thread_running。
目的是为了安全,防止原始表上的触发器引起负载过高。这也是为了防止在线DDL对线上的影响。
超过设置的阀值,就会终止操作,在线DDL就会中断。提示的异常如上报错信息。
和这个参数有的类似的还有一个--max-load :
--max-load
type: Array; default: Threads_running=25
Examine SHOW GLOBAL STATUS after every chunk, and pause if any status variables are higher than their thresholds.
The option accepts a comma-separated list of MySQL status variables. An optional =MAX_VALUE (or :MAX_VALUE) can
follow each variable. If not given, the tool determines a threshold by examining the current value and increasing it by 20%.
For example, if you want the tool to pause when Threads_connected gets too high, you can specify “Threads_connected”,
and the tool will check the current value when it starts working and add 20% to that value. If the current value is 100,
then the tool will pause when Threads_connected exceeds 120, and resume working when it is below 120 again. If you want to
specify an explicit threshold, such as 110, you can use either “Threads_connected:110” or “Threads_connected=110”.
The purpose of this option is to prevent the tool from adding too much load to the server. If the data-copy queries are
intrusive, or if they cause lock waits, then other queries on the server will tend to block and queue. This will typically
cause Threads_running to increase, and the tool can detect that by running SHOW GLOBAL STATUS immediately after each query finishes.
If you specify a threshold for this variable, then you can instruct the tool to wait until queries are running normally again. This will
not prevent queueing, however; it will only give the server a chance to recover from the queueing. If you notice queueing, it is best to decrease the chunk time.
--max-load 选项定义一个阀值,在每次chunk操作后,查看show global status状态值是否高于指定的阀值。该参数接受一个mysql status状态变量以及一个阀值,
如果没有给定阀值,则定义一个阀值为为高于当前值的20%。
注意这个参数不会像--critical-load终止操作,而只是暂停操作。当status值低于阀值时,则继续往下操作。
是暂停还是终止操作这是--max-load和--critical-load的差别。
参数值为列表形式,可以指定show global status出现的状态值。比如,Thread_connect 等等。
格式如下:--critical-load="Threads_running=200" 或者--critical-load="Threads_running:200"。
pt-online-schema-change --host=xxxxx -P 3306 --charset=utf8 -u root -p 'xxxxxx;' --alter='
add column door_no varchar(200) comment "居住门楼牌"
' --print --execute D=lzmh_wlw_db,t=wlw_room --critical-load="Threads_running=200"
根据参数要求,修改了Threads_runnings的阀值,成功执行了。所以文档还是很重要的,在使用一个工具之前,文档是必须要看的,
而不只是简单的使用就完事,必须要对使用过程中可能出现的异常错误进行处理。
几个注意事项:
The tool refuses to operate if it detects replication filters. See --[no]check-replication-filters for details.
The tool pauses the data copy operation if it observes any replicas that are delayed in replication. See --max-lag for details.
The tool pauses or aborts its operation if it detects too much load on the server. See --max-load and --critical-load for details.
The tool sets its lock wait timeout to 1 second so that it is more likely to be the victim of any lock contention, and less likely to disrupt other transactions. See --lock-wait-timeout for details.
The tool refuses to alter the table if foreign key constraints reference it, unless you specify --alter-foreign-keys-method.
The tool cannot alter MyISAM tables on “Percona XtraDB Cluster” nodes.
pt-online-schema-change 在线DDL工具,虽然说不会锁表,但是对性能还是有一定的影响,执行过程中对全表做一次select。
这个过程会将buffer_cache中活跃数据全部交换一遍,这就导致活跃数据的请求都要从磁盘获取,导致慢SQL增多,file_reads增大。