AWS实战:Aurora到Redshift数据同步

什么是Aurora

  • Amazon Aurora是一种基于云且完全托管关系型数据库服务,与MySQL 和 PostgreSQL 数据库兼容,完全托管意味着自动对数据库进行管理,包括管理数据备份、硬件配置和软件更新等操作

  • Amazon Aurora提供了企业级性能

  • Amazon Aurora提供了多种实例类型,不同的实例类型有不同的CPU,内存,带宽等配置,用户可以根据需求选择合适的实例类型来支持自己的业务

  • Amazon Aurora 提供了可扩展性,可以根据需要自动扩大或缩小存储规模。Amazon Aurora 将自动从最小 10 GB 存储扩展到 64 TB 存储,每次增量为 10 GB,且扩展不会影响数据库的性能

什么是Redshift

  • Amazon Redshift 是一个基于云的完全托管的数据仓库服务。开发人员可以使用 redshift 来构建、查询和维护多达 PB 或更多的数据。

  • 每个 Redshift 数据仓库都是完全托管的,这意味着备份创建、安全和配置等管理任务都是自动化的。

  • Redshift 中的每个数据仓库都由一组节点组成。这些节点被组织成集群,使得对大量数据运行分析查询变得容易。

  • 许多现有的基于 SQL 的客户端,以及范围广泛的数据源和数据分析工具,都可以与 Redshift 一起使用。它具有稳定的架构,可以轻松地与各种商业智能工具进行交互。

  • 由于 Redshift 旨在处理大量数据,因此其模块化设计使其能够轻松扩展。它的多层结构使一次处理多个查询变得简单。

  • 可以从 Redshift 集群创建切片,允许对数据集进行更精细的检查

  • Redshift 提供的性能指标可用于跟踪数据库的健康状况

为什么会有从Aurora到Redshift数据同步的需求

  • Aurora 是一个基于行的数据库,这使得它最适合事务查询(OLTP)业务,比如Web 应用程序。想根据用户 ID 查找用户名?使用 Aurora 很容易做到。

  • Redshift 是一个基于列的数据库,旨在高效处理对大量数据的复杂查询,这使得它最适合数据分析(OLAP)业务,比如电商网站用户行为分析。

  • 比如我们可以用Aurora来实现电商网站的基本业务功能,用户管理,订单管理,商品管理等基本功能,然后用Redshift来实现用户行为分析,商品销售情况分析等等需要处理海量数据的查询功能。

AWS Data Migration Service

  • AWS Database Migration Service (AWS DMS) 用于管理和迁移云上的数据库。

  • 借助 AWS DMS,客户可以跨许多流行的商业数据库同步数据,包括 Microsoft SQL Server、Amazon Aurora 和 MySQL。

  • 使用 AWS DMS,用户可以运行一次性迁移或设置持续同步。

  • 在同步数据库之前,需要在云服务器上创建一个复制实例。此实例读取源数据并将其加载到目标。

如何使用AWS DMS从Aurora同步数据到Redshift

  • 在实际开发过程中,我们的web服务器位于VPC public subnet中,允许任何客户端进行访问

  • 用户数据库一般位于VPC的private subnet中,且只允许同一个VPC的web服务器进行访问,以保证数据库的安全

  • DMS进行数据同步使用的instance也只需要位于private subnet,无需对外开放

  • 因此我们首先需要创建一个VPC,vpc的架构如下:

AWS实战:Aurora到Redshift数据同步_第1张图片

接下来我们详细介绍每一部分需要创建的资源。

IAM

如果在资源创建过程中出现类似如下没有权限的错误,则需要在AWS Web Console IAM页面为用户添加相应的权限。

User: arn:aws:iam::****:user/*** is not authorized to perform: dms:CreateReplicationSubnetGroup on resource: arn:aws:dms:ap-southeast-1:****:subgrp:aurora-to-redshift-instance because no identity-based policy allows the dms:CreateReplicationSubnetGroup action (Service: AWSDatabaseMigrationService; Status Code: 400; Error Code: AccessDeniedException; ; Proxy: null)

本文中用到的权限如下:

  • AmazonS3FullAccess: AWS Managed Policy

  • AmazonVPCFullAccess: AWS Managed Policy

  • AmazonRDSFullAccess: AWS Managed Policy

  • AmazonRedshiftFullAccess: AWS Managed Policy

  • DMS没有找到AWS Managed Policy,需要自己手动创建一下,内容如下:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "dms:*",
            "Resource": "*"
        }
    ]
}

VPC

  • vpc

  • cird block: 10.10.0.0/16

  • internet gateway

  • public subnet

  • 三个Availability Zone

  • private subnet

  • 三个Availability Zone

  • public route table: public subnet关联的 route table

  • destination: 0.0.0.0/0 target: internet-gateway-id(允许与外界进行通信)

  • destination:10.10.0.0/16 local(内部通信)

  • private route table:private subnet关联的 route table

  • destination:10.10.0.0/16 local(内部通信)

  • web server security group

  • 允许任意ip对443端口进行访问

  • 允许自己的ipdui22端口进行访问,以便ssh到服务器上向数据库插入数据

  • private dms security group

  • 只允许vpc内部访问

  • private aurora security group

  • 允许web server security group对3306端口进行访问

  • 允许dms security group对3306端口进行访问

  • private redshift aurora security group

  • 允许web server security group对5439端口进行访问

  • 允许dms security group对5439端口进行访问

VPC全部资源的serverless文件:

  • custom:bucketNamePrefix 替换为自己的创建的bucket

  • resources:Parameters:MyIP 替换为自己的public ip

service: aurora-to-redshift-vpc

custom:
  bucketNamePrefix: "bucketName"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    VpcName:
      Type: String
      Default: "test-vpc"
    MyIP:
      Type: String
      Default: "your public ip"

  Resources:
    VPC:
      Type: "AWS::EC2::VPC"
      Properties:
        CidrBlock: "10.10.0.0/16"
        EnableDnsSupport: true
        EnableDnsHostnames: true
        InstanceTenancy: default
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}"
    # Internet Gateway
    InternetGateway:
      Type: "AWS::EC2::InternetGateway"
      Properties:
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_InternetGateway"
    VPCGatewayAttachment:
      Type: "AWS::EC2::VPCGatewayAttachment"
      Properties:
        VpcId: !Ref VPC
        InternetGatewayId: !Ref InternetGateway

    # web server security group
    WebServerSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow access from public
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 443
            ToPort: 443
            CidrIp: "0.0.0.0/0"
          - IpProtocol: tcp
            FromPort: 22
            ToPort: 22
            CidrIp: !Sub "${MyIP}/32"
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_WebServerSecurityGroup"

    # public route table
    RouteTablePublic:
      Type: "AWS::EC2::RouteTable"
      Properties:
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_RouteTablePublic"
    RouteTablePublicInternetRoute:
      Type: "AWS::EC2::Route"
      DependsOn: VPCGatewayAttachment
      Properties:
        RouteTableId: !Ref RouteTablePublic
        DestinationCidrBlock: "0.0.0.0/0"
        GatewayId: !Ref InternetGateway

    # public subnet
    SubnetAPublic:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [0, !GetAZs ""]
        CidrBlock: "10.10.0.0/24"
        MapPublicIpOnLaunch: true
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetAPublic"
    RouteTableAssociationAPublic:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetAPublic
        RouteTableId: !Ref RouteTablePublic

    SubnetBPublic:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [1, !GetAZs ""]
        CidrBlock: "10.10.32.0/24"
        MapPublicIpOnLaunch: true
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetBPublic"
    RouteTableAssociationBPublic:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetBPublic
        RouteTableId: !Ref RouteTablePublic

    SubnetCPublic:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [2, !GetAZs ""]
        CidrBlock: "10.10.64.0/24"
        MapPublicIpOnLaunch: true
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetCPublic"
    RouteTableAssociationCPublic:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetCPublic
        RouteTableId: !Ref RouteTablePublic

    # instance security group
    PrivateDMSInstanceSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow http to client host
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 443
            ToPort: 443
            CidrIp: "10.10.0.0/16"
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_PrivateDMSInstanceSecurityGroup"
    # aurora security group
    PrivateAuroraSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow access from dms instance securty group
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 3306
            ToPort: 3306
            SourceSecurityGroupId: !Ref PrivateDMSInstanceSecurityGroup
          - IpProtocol: tcp
            FromPort: 3306
            ToPort: 3306
            SourceSecurityGroupId: !Ref WebServerSecurityGroup
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_PrivateAuroraSecurityGroup"
    # redshift security group
    PrivateRedshiftSecurityGroup:
      Type: AWS::EC2::SecurityGroup
      Properties:
        GroupDescription: Allow access from dms instance securty group
        VpcId: !Ref VPC
        SecurityGroupIngress:
          - IpProtocol: tcp
            FromPort: 5439
            ToPort: 5439
            SourceSecurityGroupId: !Ref PrivateDMSInstanceSecurityGroup
          - IpProtocol: tcp
            FromPort: 5439
            ToPort: 5439
            SourceSecurityGroupId: !Ref WebServerSecurityGroup
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_PrivateRedshiftSecurityGroup"

    # private route table
    RouteTablePrivate:
      Type: "AWS::EC2::RouteTable"
      Properties:
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_RouteTablePrivate"

    # private subnet
    SubnetAPrivate:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [0, !GetAZs ""]
        CidrBlock: "10.10.16.0/24"
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetAPrivate"
    RouteTableAssociationAPrivate:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetAPrivate
        RouteTableId: !Ref RouteTablePrivate
    SubnetBPrivate:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [1, !GetAZs ""]
        CidrBlock: "10.10.48.0/24"
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetBPrivate"
    RouteTableAssociationBPrivate:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetBPrivate
        RouteTableId: !Ref RouteTablePrivate
    SubnetCPrivate:
      Type: "AWS::EC2::Subnet"
      Properties:
        AvailabilityZone: !Select [2, !GetAZs ""]
        CidrBlock: "10.10.80.0/24"
        VpcId: !Ref VPC
        Tags:
          - Key: Name
            Value: !Sub "VPC_${VpcName}_SubnetCPrivate"
    RouteTableAssociationCPrivate:
      Type: "AWS::EC2::SubnetRouteTableAssociation"
      Properties:
        SubnetId: !Ref SubnetCPrivate
        RouteTableId: !Ref RouteTablePrivate

  Outputs:
    VPC:
      Description: "VPC."
      Value: !Ref VPC
      Export:
        Name: !Sub "${self:provider.stackName}"
    SubnetsPublic:
      Description: "Subnets public."
      Value:
        !Join [
          ",",
          [!Ref SubnetAPublic, !Ref SubnetBPublic, !Ref SubnetCPublic],
        ]
      Export:
        Name: !Sub "${self:provider.stackName}-PublicSubnets"
    SubnetsPrivate:
      Description: "Subnets private."
      Value:
        !Join [
          ",",
          [!Ref SubnetAPrivate, !Ref SubnetBPrivate, !Ref SubnetCPrivate],
        ]
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateSubnets"
    DefaultSecurityGroup:
      Description: "VPC Default Security Group"
      Value: !GetAtt VPC.DefaultSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-DefaultSecurityGroup"
    WebServerSecurityGroup:
      Description: "VPC Web Server Security Group"
      Value: !Ref WebServerSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-WebServerSecurityGroup"
    PrivateDMSInstanceSecurityGroup:
      Description: "The id of the InstanceSecurityGroup"
      Value: !Ref PrivateDMSInstanceSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateDMSInstanceSecurityGroup"
    PrivateAuroraSecurityGroup:
      Description: "The id of the AuroraSecurityGroup"
      Value: !Ref PrivateAuroraSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateAuroraSecurityGroup"
    PrivateRedshiftSecurityGroup:
      Description: "The id of the RedshiftSecurityGroup"
      Value: !Ref PrivateRedshiftSecurityGroup
      Export:
        Name: !Sub "${self:provider.stackName}-PrivateRedshiftSecurityGroup"

Aurora

  • DB Cluster Parameter Group

  • 要使用DMS对aurora进行数据同步,需要打开binlog相关的配置,默认的Cluster Parameter Group中binlog的配置是关闭的,并且默认Cluster Parameter Group是不可修改的

  • 因此必须创建一个自己的Cluster Parameter Group ,并设置binlog相关的配置

  • Family: aurora-mysql8.0

  • Parameters:

  • binlog_format: "ROW"

  • binlog_row_image: "Full"

  • binlog_checksum: "NONE"

  • VpcSecurityGroupIds

  • private aurora security group

DB Parameter Group

  • Family: aurora-mysql8.0

  • Subnet group

  • 创建一个包含private subnet的private subnet group就可以

  • DB Cluster

  • subnet group

  • private subnet group

Engine: aurora-mysql

  • EngineVersion: 8.0.mysql_aurora.3.02.0

  • 可以使用如下命令查看支持的EngineVersion:

aws rds describe-db-engine-versions --engine aurora --query "DBEngineVersions[].EngineVersion"
  • DB Cluster Instance

  • Engine: aurora-mysql

  • DBInstanceClass: db.t2.small

  • 可以使用如下命令查看支持的instance类型:

aws rds describe-orderable-db-instance-options --engine aurora-mysql --engine-version "8.0.mysql_aurora.3.02.2" --query "OrderableDBInstanceOptions
[].DBInstanceClass"

aurora全部资源的serverless文件:

  • custom:bucketNamePrefix 替换为自己的创建的bucket

service: aurora-to-redshift-aurora

custom:
  bucketNamePrefix: "bucketName"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    ServiceName:
      Type: String
      Default: aurora-to-redshift

  Resources:
    DBClusterParameterGroup:
      Type: "AWS::RDS::DBClusterParameterGroup"
      Properties:
        DBClusterParameterGroupName: cluster-aurora-mysql8-with-binlog
        Description: DB Cluster Parameter Group with aurora-mysql8.0 engine, and binlog enabled
        Family: aurora-mysql8.0
        Parameters:
          binlog_format: "ROW"
          binlog_row_image: "Full"
          binlog_checksum: "NONE"
    DBParameterGroup:
      Type: "AWS::RDS::DBParameterGroup"
      Properties:
        DBParameterGroupName: instance-mysql8
        Description: DB Instance Parameter Group
        Family: aurora-mysql8.0
    PrivateDBSubnetGroup:
      Type: "AWS::RDS::DBSubnetGroup"
      Properties:
        DBSubnetGroupName: private-subnet
        DBSubnetGroupDescription: "private subnet"
        SubnetIds:
          Fn::Split:
            - ","
            - Fn::ImportValue: !Sub "${ServiceName}-vpc-PrivateSubnets"
    RDSCluster:
      Type: "AWS::RDS::DBCluster"
      Properties:
        DBClusterIdentifier: test-data-sync-aurora
        DBClusterParameterGroupName:
          Ref: DBClusterParameterGroup
        DBSubnetGroupName:
          Ref: PrivateDBSubnetGroup
        VpcSecurityGroupIds:
          - Fn::ImportValue: !Sub "${ServiceName}-vpc-PrivateAuroraSecurityGroup"
        Engine: aurora-mysql
        EngineVersion: 8.0.mysql_aurora.3.02.2
        DatabaseName: dev
        MasterUserPassword: aurora_admin_2022
        MasterUsername: admin
    RDSDBInstance:
      Type: "AWS::RDS::DBInstance"
      Properties:
        DBClusterIdentifier:
          Ref: RDSCluster
        DBInstanceIdentifier: test-data-sync-aurora-instance
        DBInstanceClass: db.t3.medium
        DBParameterGroupName:
          Ref: DBParameterGroup
        DBSubnetGroupName:
          Ref: PrivateDBSubnetGroup
        Engine: aurora-mysql
  Outputs:
    AuroraEndpoint:
      Description: "Aurora endpoint"
      Value: !GetAtt RDSCluster.Endpoint.Address
      Export:
        Name: !Sub "${self:provider.stackName}-AuroraEndpoint"
    AuroraPort:
      Description: "Aurora port"
      Value: !GetAtt RDSCluster.Endpoint.Port
      Export:
        Name: !Sub "${self:provider.stackName}-AuroraPort"
  • 在创建完资源后,连接数据库,执行一下sql创建数据

  • 连接数据库需要创建一个位于vpc public subnet的服务器,创建过程参考这里:

  • https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_Tutorials.WebServerDB.LaunchEC2.html

  • vpc: 选择上面步骤中我们自己创建的vpc

  • subnet:选择任意一个public subnet

  • security group:web server security group

  • key pair: 创建一个名为tutorial-ec2-instance-web-server的key pari,选择pem(ssh)格式

  • 连接web服务器:参考https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Connecting.html#Aurora.Connecting.AuroraMySQL

  • 打开终端或者SSH Client

  • 找到创建key pair时下载的文件: tutorial-ec2-instance-web-server.pem

  • 执行一下命令修改文件的执行权限

chmod 400 tutorial-ec2-instance-web-server.pem
  • 使用Public DNS连接到ec2实例,可以在aws web console打开ec2实例的页面,点击右上角的Connect来找到如何连接

ssh -i "tutorial-ec2-instance-web-server.pem" ec2-user@${ec2_public_dns}.ap-southeast-1.compute.amazonaws.com
AWS实战:Aurora到Redshift数据同步_第2张图片
  • 连接完成后,安装mysql

sudo yum install mysql
  • 连接aurora,输入以下命令,然后输入密码继续

mysql -h {aurora-cluster}.ap-southeast-1.rds.amazonaws.com -P 3306 -u admin -p
  • 执行以下sql插入数据

USE dev;
CREATE TABLE person (
    person_id int,
    last_name varchar(255),
    first_name varchar(255),
    city varchar(255)
);
INSERT INTO person VALUES (1, 'win', 'jessica', 'sh');

Redshift

  • Cluster subnet group

  • 创建一个包含private subnet的private subnet group

  • Cluster

  • ClusterSubnetGroupName

  • 使用private subnet group

  • VpcSecurityGroupIds

  • 使用private redshift security group

  • NodeType: dc2.large

  • ClusterType: single-node

  • redshift全部资源的serverless文件:

  • custom:bucketNamePrefix 替换为自己的创建的bucket

service: aurora-to-redshift-redshift

custom:
  bucketNamePrefix: "bucketName"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    ServiceName:
      Type: String
      Default: aurora-to-redshift

  Resources:
    ClusterSubnetGroup:
      Type: "AWS::Redshift::ClusterSubnetGroup"
      Properties:
        Description: Private Cluster Subnet Group
        SubnetIds:
          Fn::Split:
            - ","
            - Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateSubnets
        Tags:
          - Key: Name
            Value: private-subnet
    Cluster:
      Type: "AWS::Redshift::Cluster"
      Properties:
        ClusterIdentifier: test-data-sync-redshift
        ClusterSubnetGroupName: !Ref ClusterSubnetGroup
        VpcSecurityGroupIds:
          - Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateRedshiftSecurityGroup
        DBName: dev
        MasterUsername: admin
        MasterUserPassword: Redshift_admin_2022
        NodeType: dc2.large
        ClusterType: single-node
  Outputs:
    RedshiftEndpoint:
      Description: "Redshift endpoint"
      Value: !GetAtt Cluster.Endpoint.Address
      Export:
        Name: !Sub "${self:provider.stackName}-RedshiftEndpoint"
    RedshiftPort:
      Description: "Redshift port"
      Value: !GetAtt Cluster.Endpoint.Port
      Export:
        Name: !Sub "${self:provider.stackName}-RedshiftPort"

DMS

  • Subnet group

  • 创建一个包含private subnet的private subnet group

  • Replication instance

  • ReplicationInstanceClass: dms.t2.small

  • ReplicationSubnetGroupIdentifier:

  • 使用private subnet group

  • VpcSecurityGroupIds

  • 使用dms security group

  • Source endpoint

  • EndpointType: source

  • EngineName: aurora

  • ServerName: aurora cluster endpoint

  • Port: 3306

  • Target endpoint

  • EndpointType: target

  • EngineName: ednpoint

  • ServerName: redshift cluster endpoint

  • Port: 5439

  • Database migration task

  • MigrationType: full-load-and-cdc

  • ReplicationInstanceArn

  • TableMappings:"{ \"rules\": [ { \"rule-type\": \"selection\", \"rule-id\": \"1\", \"rule-name\": \"1\", \"object-locator\": { \"schema-name\": \"dev\", \"table-name\": \"test-aurora-to-redshift\" }, \"rule-action\": \"include\" } ] }"

  • ReplicationTaskSettings: "{ \"FullLoadSettings\": {\"TargetTablePrepMode\": \"DO_NOTHING\"}}"

  • dms全部资源的serverless文件

  • custom:bucketNamePrefix 替换为自己的创建的bucket

service: aurora-to-redshift-dms

custom:
  bucketNamePrefix: "bucketName"

provider:
  name: aws
  region: ${opt:region, "ap-southeast-1"}
  stackName: ${self:service}
  deploymentBucket:
    name: com.${self:custom.bucketNamePrefix}.deploy-bucket
    serverSideEncryption: AES256

resources:
  Parameters:
    ServiceName:
      Type: String
      Default: aurora-to-redshift

  Resources:
    ReplicationSubnetGroup:
      Type: "AWS::DMS::ReplicationSubnetGroup"
      Properties:
        ReplicationSubnetGroupDescription: instance to sync data from aurora to redshift
        ReplicationSubnetGroupIdentifier: aurora-to-redshift-instance
        SubnetIds:
          Fn::Split:
            - ","
            - Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateSubnets
    ReplicationInstance:
      Type: "AWS::DMS::ReplicationInstance"
      Properties:
        ReplicationInstanceIdentifier: aurora-to-redshift-instance
        ReplicationInstanceClass: dms.t2.small
        ReplicationSubnetGroupIdentifier: !Ref ReplicationSubnetGroup
        PubliclyAccessible: false
        VpcSecurityGroupIds:
          - Fn::ImportValue: !Sub ${ServiceName}-vpc-PrivateDMSInstanceSecurityGroup
    AuroraSourceEndpoint:
      Type: "AWS::DMS::Endpoint"
      Properties:
        EndpointIdentifier: aurora
        EndpointType: source
        ServerName:
          Fn::ImportValue: !Sub ${ServiceName}-aurora-AuroraEndpoint
        Port:
          Fn::ImportValue: !Sub ${ServiceName}-aurora-AuroraPort
        Username: admin
        Password: aurora_admin_2022
        EngineName: aurora
    RedshiftTargetEndpoint:
      Type: "AWS::DMS::Endpoint"
      Properties:
        EndpointIdentifier: redshift
        EndpointType: target
        ServerName:
          Fn::ImportValue: !Sub ${ServiceName}-redshift-RedshiftEndpoint
        Port:
          Fn::ImportValue: !Sub ${ServiceName}-redshift-RedshiftPort
        DatabaseName: dev
        Username: admin
        Password: Redshift_admin_2022
        EngineName: redshift
    ReplicationTask:
      Type: "AWS::DMS::ReplicationTask"
      Properties:
        ReplicationTaskIdentifier: aurora-to-redshift-task
        MigrationType: full-load-and-cdc
        ReplicationInstanceArn: !Ref ReplicationInstance
        SourceEndpointArn: !Ref AuroraSourceEndpoint
        TableMappings: '{ "rules": [ { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "dev", "table-name": "%" }, "rule-action": "include" } ] }'
        TargetEndpointArn: !Ref RedshiftTargetEndpoint
        ReplicationTaskSettings: '{ "FullLoadSettings": {"TargetTablePrepMode": "DO_NOTHING"}}'

数据同步

  • 上述资源创建完毕后,可以在aws web console运行DMS的Database migration task,执行成功后status会是Load complete, replication ongoing

AWS实战:Aurora到Redshift数据同步_第3张图片

  • 同步完成后,可以通过Redshift query editor来确认数据是否同步成功。

  • 继续修改,添加,删除mysql数据,然后通过Redshift query editor来确认数据是否同步成功。

其他同步方法

Amazon Aurora zero-ETL integration with Amazon Redshift

  • Amazon即将推出一个Aurora的新功能,实现与 Amazon Redshift 的零 ETL 集成,目前这个功能还没有正式推出,处于Preview阶段,且在美国东部(弗吉尼亚北部)区域可用。以下来自Amazon公告,原文链接在此:

  • Amazon Aurora 现在支持与 Amazon Redshift 的零 ETL 集成,以使用 Amazon Redshift 对来自 Aurora 的数 PB 交易数据进行近乎实时的分析和机器学习 (ML)。在事务数据写入 Aurora 的几秒钟内,数据就可以在 Amazon Redshift 中使用,因此您不必构建和维护复杂的数据管道来执行提取、转换和加载 (ETL) 操作。

  • 这种零 ETL 集成还使您能够在同一个新的或现有的 Amazon Redshift 实例中分析来自多个 Aurora 数据库集群的数据,以获得跨多个应用程序或分区的整体见解。通过近乎实时地访问交易数据,您可以利用 Amazon Redshift 的分析和功能,例如内置 ML、物化视图、数据共享以及对多个数据存储和数据湖的联合访问,从交易数据和其他数据中获得洞察力。

  • Amazon Aurora 与 Amazon Redshift 的零 ETL 集成现已在美国东部(弗吉尼亚北部)区域推出兼容 MySQL 8.0 的 Amazon Aurora MySQL 3 的有限预览版。

AWS Data Pipeline

  • AWS Data Pipeline也支持aurora到redshift的持续同步

  • 相比于DMS的缺点

  • AWS Data Pipeline目前只支持以下region

AWS实战:Aurora到Redshift数据同步_第4张图片
  • 一个AWS Data Pipeline只能同步一张aurora表,aurora表名必须完全匹配

  • 相比于DMS的优点

  • 同步过程中支持自定义查询语句,可以对源表数据进行筛选之后同步到redshift

AWS Glue

  • AWS Glue也支持aurora到redshift同步

  • 相比于DMS的缺点

  • 一个AWS Data Pipeline只能同步一张aurora表

  • 不支持持续同步,只支持全表数据同步

  • 需要在glue管理源表和目标表的scheme

  • 相比于DMS的优点

  • 可以在数据同步过程中对数据列应用mapping规则,可以按一定规则修改列值,可以删除某些列(目标表比源表少某些列),目标表与源表列名也可以不同

如何进行选择

  • 如果需要持续同步且要使用的region不支持AWS Data Pipeline,则只能使用AWS DMS

  • 如果需要持续同步且要使用的region支持AWS Data Pipeline,则可以根据其他需求决定使用哪个service,比如如果需要对数据进行筛选则可以选择AWS Data Pipeline

  • 如果只需要整表一次性同步,且没有数据修改的需求,则可以选择任意一个

  • 如果只需要整表一次性同步,且需要在同步过程中对数据进行一定规则的修改,则选择AWS Glue

github

https://github.com/JessicaWin/aws-in-action/tree/master/aurora-to-redshift

你可能感兴趣的:(AWS实战,aws,数据库,数据仓库)