阿里云rds for mysql 大表添加字段复制到自建备库报错,解决过程

表的大小大概3600W数据,一次添加多个字段(12个字段)
使用的是mysql5.6 在线ddl操作
在rds上大概执行了70多分钟,添加完毕
临时空间大概使用70个G

监控自建备库的执行,执行大概1个小时报错如下:
2018-08-06 21:16:13 7fcb4613d700 InnoDB: Error: Write to file (merge) failed at offset 31842107392.
InnoDB: 1048576 bytes should have been written, only 90112 were written.
InnoDB: Operating system error number 0.
InnoDB: Check that your OS and file system support files of this size.
InnoDB: Check also that the disk is not full or a disk quota exceeded.
InnoDB: Error number 0 means 'Success'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
2018-08-06 21:16:13 19593 [ERROR] Slave SQL: Error 'Temporary file write failure.' on query. Default database: 'ee_ertt'. Query: 'ALTER TABLE da44sdfdftle ADD `shiftInNum` DOUBLE (16, 5) DEFAULT 0 COMMENT '璋妯?
',^M伴
 ADD `shiftInCost` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '璋妯?
 ADD `shiftOutNum` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '璋妯搴搴姘?,^M
 ADD `shiftOutCost` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '璋妯搴搴殚?^M
殒惰揣姘?,^MceiptNum` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '?
殒惰揣殚?^MeceiptCost` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '?
娑璐ф伴',^MReceiptNum` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '杩?
娑璐ч棰,^MnReceiptCost` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '杩?
彖揣姘?,^MnDeliveryNum` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '杩?
彖揣殚?^MrnDeliveryCost` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '杩?
 ADD `useReturnNum` DOUBLE (16, 5) DEFAULT 0.00000 COMMENT '棰绋?
2018-08-06 21:16:13 19593 [Warning] Slave: Temporary file write failure. Error_code: 1878
2018-08-06 21:16:13 19593 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000810' position 137229440
2018-08-06 21:30:18 7fcb3cad7700 InnoDB: Error: Write to file (merge) failed at offset 31847350272.
InnoDB: 1048576 bytes should have been written, only 978944 were written.
InnoDB: Operating system error number 0.
InnoDB: Check that your OS and file system support files of this size.
InnoDB: Check also that the disk is not full or a disk quota exceeded.
InnoDB: Error number 0 means 'Success'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html

报空间不错
用show variables like '%tmpdir%';查看tmpdir指向    /tmp在 
但系统的根目录/ 总的大小为40G

修改tmpdir 执行大的目录,这个参数不能在线修改,添加参数到参数文件里重启mysql数据库
tmpdir =/data/tmp

再次开启复制 start slave;
执行到00:51时候报错,mysql 挂掉了InnoDB: Error: Write to file
Next activation : never
2018-08-06 23:21:52 3588 [Warning] IP address '183.159.182.171' could not be resolved: Name or service not known
2018-08-06 23:22:21 3588 [Warning] IP address '118.31.44.222' could not be resolved: Name or service not known
2018-08-06 23:36:51 3588 [Warning] IP address '58.100.103.136' could not be resolved: Name or service not known
2018-08-07 00:51:45 7fc519552700 InnoDB: Error: Write to file ./ybl_erp/#sql-ib1166.ibd failed at offset 35306602496.
InnoDB: 1048576 bytes should have been written, only 307200 were written.
InnoDB: Operating system error number 0.
InnoDB: Check that your OS and file system support files of this size.
InnoDB: Check also that the disk is not full or a disk quota exceeded.
InnoDB: Error number 0 means 'Success'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
16:51:45 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

具体原因位置,还是添加字段写临时的表时候报错,看来这个备库正常执行是不行了

启动mysql 
有这张大表的数据不是实时变化的,只有在每天23:59:59秒插入数据
所以现在
1.情况大表数据(truncate table)
2.开启复制把添加字段的语句同步过来(start slave)
3.关闭复制,单独迁移主库大表数据到自己备库上来(用阿里云的DTS迁移单表3600w 大概需要94分钟)
注意点迁移的时候备库的binglog日志没有关闭,如果关闭了速度会更快,另外迁移的过程需要关注binlog日志的大小以及及时清空binlog日志产生大概50G左右的日志防止把空间撑爆
4.迁移完成后需要删除添加字段以后添加的数据
5.启动复制进程(start slave)
6.观察复制正常

注意:(如果是表数据要是实时变化的用此种方法不行的)

可以使用
1.首先在自建备份库把表名字改了
2.然后新建表结构一样表
3.开启同步把添加字段的语句同步过来(或者新建的表已经包含添加的字段了)
4.把备份数据迁移回去
5.开启复制进程
没有采用此方法的原因是因为我的磁盘空不够,因为这样需要大表两倍以上的空间才能操作

大表添加字段要考虑
1.检查是否有足够空间
2.测试评估需要多长时间
3.考虑备库的复制延迟
直接alter table容易锁表
并发系统直接alter table易出现拿不到锁,导致server Crash
MySQL 5.6 持在线的online ddl,但容易造成复制延迟
对于alter table操作,请查看每个版本,确认只是更改字典信息,copy可以在线操作
拿不准情况推荐: pt-osc
mysql> show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: rm-bp1q8m0fbf5sjpl11ii.mysql.rds.aliyuncs.com
                  Master_User: rplslave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000813
          Read_Master_Log_Pos: 176958300
               Relay_Log_File: relay-bin.000344
                Relay_Log_Pos: 176958470
        Relay_Master_Log_File: mysql-bin.000813
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 176958300
              Relay_Log_Space: 176958755
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 985930441
                  Master_UUID: 9c182817-050c-11e8-be72-7cd30ae00a0e
             Master_Info_File: /data/mysqldata/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 9c182817-050c-11e8-be72-7cd30ae00a0e:10322657-16590246
            Executed_Gtid_Set: 2a02886b-9373-11e7-99b0-7cd30ac33424:1-49891487,
9c182817-050c-11e8-be72-7cd30ae00a0e:1-16590246,
cabfbbf6-7ea0-11e8-bf60-00163e11a4fe:1-213319
                Auto_Position: 1

你可能感兴趣的:(mysql)