Mysql到Greenplum数据迁移方案

使用开源组件mysql2pgsql进行mysql到GP的数据迁移:

  1. 下载 mysql2pgsql
    下载地址

  2. 配置mysql和gp连接信息配置文件,参考
    配置文件

以迁移表potluck为例

此处测试版本为mysql 5.7,mysql当前最新版本为8.0

A.全局数据迁移

  1. mysql -u root -p 登录数据库

  2. 进入test1数据库 use test1,并找到要迁移的表信息:

MariaDB [test1]> DESCRIBE potluck;
±------------±------------±-----±----±--------±---------------+
| Field | Type | Null | Key | Default | Extra |
±------------±------------±-----±----±--------±---------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(20) | YES | | NULL | |
| food | varchar(30) | YES | | NULL | |
| confirmed | char(1) | YES | | NULL | |
| signup_date | date | YES | | NULL | |
±------------±------------±-----±----±--------±---------------+

MariaDB [test1]> select * from potluck;
±—±------±---------------±----------±------------+
| id | name | food | confirmed | signup_date |
±—±------±---------------±----------±------------+
| 1 | John | Casserole | Y | 2012-04-11 |
| 2 | Sandy | Key Lime Tarts | N | 2012-04-14 |
| 3 | Tom | BBQ | Y | 2012-04-18 |
| 4 | Tina | Salad | Y | 2012-04-10 |
±—±------±---------------±----------±------------+
4 rows in set (0.00 sec)

  1. 在mysql2pgsql的bin目录下找到并修改my.cfg文件并且配置响应mysql和greenplum的连接信息:

[gpadmin@vmcentos7 bin]$ cat my.cfg

–mysql源端的连接信息
[src.mysql]
host = “10.132.17.64”
port = “3306”
user = “root”
password = “changeme”
db = “test1”
encodingdir = “share”
encoding = “utf8”
binlogfile = “mysql-bin.000001”
binlogfile_offset = “4”
serverid = 1

[src.pgsql]
connect_string = “host=10.132.17.65 dbname=gpnrt port=5432 user=gpadmin password=gpadmin”

[local.pgsql]
connect_string = “dbname=test port=5888 user=test password=pgsql”

–Greenplum目标端的连接信息
[desc.pgsql]
connect_string = “host=10.132.17.65 dbname=gpnrt port=5432 user=gpadmin password=gpadmin”
target_schema = “public”
ignore_copy_error_count_each_table = “0”

[binlogloader]
loader_table_list = “loader_table_list.txt”
load_batch = 10
load_batch_gap = 10

  1. 在加载文件中添加需要迁移的数据表名称:

[gpadmin@vmcentos7 bin]$ cat loader_table_list.txt
potluck

  1. 首先生成需要加载表的DDL语句并加以修改
    [gpadmin@vmcentos7 bin]$ ./mysql2pgsql -l loader_table_list.txt -d -n -j 1 -s public ignore copy error count 0 each table

CREATE TABLE public.potluck (id int4, name text, food text, confirmed text, signup_date timestamp) with (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, COMPRESSLEVEL=1, BLOCKSIZE=1048576, OIDS=false) DISTRIBUTED BY ();

– Number of tables: 1

添加分布键语句并且在greenplum里执行DDL建表
CREATE TABLE public.potluck (id int4, name text, food text, confirmed text, signup_date timestamp) with (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, COMPRESSLEVEL=1, BLOCKSIZE=1048576, OIDS=false) DISTRIBUTED BY (id);

  1. 执行数据加载程序
    [gpadmin@vmcentos7 bin]$ ./mysql2pgsql -l loader_table_list.txt -j 1 -s public ignore copy error count 0 each table
    – Adding table: potluck
    Starting data sync
    Query to get source data for target table potluck: select * from test1.potluck
    – Reference DDL to create the target table:
    CREATE TABLE public.potluck (id int4, name text, food text, confirmed text, signup_date timestamp) with (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, COMPRESSLEVEL=1, BLOCKSIZE=1048576, OIDS=false) DISTRIBUTED BY () PARTITION BY RANGE () (START (date ‘’) INCLUSIVE END (date ‘’) EXCLUSIVE EVERY (INTERVAL ‘<1 month>’ ));

thread 0 migrate task 0 table test1.potluck 4 rows complete, time cost 71.990 ms
Number of rows migrated: 4 (number of source tables’ rows: 4)
Data sync time cost 184.033 ms

  1. 数据加载完成,数据已经进入greenplum

gpadmin=# select * from potluck;
id | name | food | confirmed | signup_date
----±------±---------------±----------±--------------------
2 | Sandy | Key Lime Tarts | N | 2012-04-14 00:00:00
4 | Tina | Salad | Y | 2012-04-10 00:00:00
1 | John | Casserole | Y | 2012-04-11 00:00:00
3 | Tom | BBQ | Y | 2012-04-18 00:00:00
(4 rows)

B.增量数据迁移

  1. 增量数据迁移,需要使用到binlog_loader和binlog_miner两个免费组件。

  2. 首先需要在mysql里设置一下参数使用row格式的binlog以备解析:

binlog_format = ROW

  1. 增量同步的实现方式是:

在客户端主机(也可以部署在其他主机)上启动一个临时 PG 数据库,用于临时存放从MySQL 拉去到的 binlog 数据。
binlog_miner 从源 MySQL 的一个 binlog 文件开始,拉取和解析 binlog 并存放到临时 PG中。
binlog_loader 从临时 PG 中读取增量数据,并做适当的处理,最终批量写入到目标PostgreSQL 或 Greenplum 中去。

这里特别需要配置的

my.cnf

[src.mysql]
host = “192.168.1.1”
port = “3301”
user = “test”
password = “123456”
db = “test”
encodingdir = “share”
encoding = “utf8”
binlogfile = “mysql-bin.000001” --binlog的名称
binlogfile_offset = “4” --binlog中的记录偏移指针,一般从4开始

临时数据库的信息:

my.cnf

[local.pgsql]
connect_string = “dbname=test port=5432 user=test password=pgsql”

  1. 要使用实时同步功能,目标表必须有主键,在greenplum中的建表语句如下:

CREATE TABLE public.potluck (id int4 PRIMARY KEY, name text, food text, confirmed text, signup_date timestamp) DISTRIBUTED BY (id);

  1. 启动 binlog 拉取进程
    nohup ./binlog_miner 1>minner.log 2>&1 &
    启动 binlog 写入进程
    nohup ./binlog_loader 1>loader.log 2>&1 &

这样所有在mysql中对指定表的各类数据变化都会被同步到目标DB中,包括对应行的 insert update delete等。

5.这两个同步进程所有进行过的同步操作都会在Postgres中间数据中进行记录,及时拉取和写入进程由于异常出现了重启等问题,该程序仍然可以断点续传持续对mysql和GPDB之间的表进行同步操作。

C. 在GPDB中通过外部表映射直接访问mysql中的数据表

  1. 通过GPDB提供的外部表功能和gpfdist的transform功能来直接将mysql中的表映射为GPDB的外表具体步骤如下:

  2. 编写脚本通过mysql命令输出对应mysql中表的内容:

[gpadmin@vmcentos7 mysql]$ cat mysql2gp.sh

#/bin/bash
PARA=$1 
TBNM=${PARA##*/}
cd $(cd "$(dirname "$0")"; pwd)
rm -f /mysql/${TBNM}.csv
. ~/.bashrc
echo "select * from ${TBNM} into outfile '/mysql/${TBNM}.csv' fields terminated by ',' lines terminated by '\r\n'" > ${TBNM}.sql
mysql -uroot -pchangeme test1 < ${TBNM}.sql
cat /mysql/${TBNM}.csv
rm -f /mysql/${TBNM}.csv
  1. 编写gpfdist使用的tranform文件
    [gpadmin@vmcentos7 mysql]$ cat mysql.yaml
VERTION: 1.0
TRANSFORMATIONS:
  mysql:
    TYPE: input
    CONTENT: data
    COMMAND: /bin/bash /mysql/mysql2gp.sh %filename%
  1. 在GPDB中创建响应外部表:
    create external table ext_mysql_potluck( like potluck)
    location( ‘gpfdist://192.168.234.130:8080/potluck#transform=mysql’) FORMAT ‘CSV’ (DELIMITER ‘,’) ENCODING ‘GBK’;

  2. 启动gpdfist
    gpfdist -c mysql.yaml -p 8080 -l gpfdist.log &

  3. 直接通过外表访问mysql中的表。
    gpadmin_1=# select * from ext_mysql_potluck;
    id | name | food | confirmed | signup_date
    ----±------±---------------±----------±--------------------
    1 | John | Casserole | Y | 2012-04-11 00:00:00
    2 | Sandy | Key Lime Tarts | N | 2012-04-14 00:00:00
    3 | Tom | BBQ | Y | 2012-04-18 00:00:00
    4 | Tina | Salad | Y | 2012-04-10 00:00:00
    5 | ryan | Casserole | N | 2012-04-11 00:00:00
    6 | Alice | Casserole | N | 2012-04-11 00:00:00
    7 | ryan | Casserole | N | 2012-04-11 00:00:00
    8 | ryan | Casserole | N | 2012-04-11 00:00:00
    9 | ryan | Casserole | N | 2012-04-11 00:00:00
    10 | Alice | Casserole | Y | 2012-04-11 00:00:00
    11 | ryan | Casserole | N | 2012-04-11 00:00:00
    (11 rows)

你可能感兴趣的:(Greenplum,Mysql,数据迁移)