使用 Amazon Redshift ML 构建多分类模型

Amazon Redshift ML 通过使用简单的 SQL 语句使用 Amazon Redshift 中的数据创建和训练机器学习（ML）模型，简化了机器学习 (ML) 的操作。您可以使用 Amazon Redshift ML 来解决二进制分类、多分类和回归问题，并可以直接使用 AutoML 或 XGBoost 等技术。

Amazon Redshift ML
https://aws.amazon.com/redshi...
Amazon Redshift
http://aws.amazon.com/redshift

想要了解更多亚马逊云科技最新技术发布和实践创新，敬请关注2021亚马逊云科技中国峰会！点击图片报名吧～

这篇文章是 Amazon Redshift ML 系列的一部分。有关使用 Amazon Redshift ML 构建回归的更多信息，请参阅使用 Amazon Redshift ML 构建回归模型。

使用 Amazon Redshift ML 构建回归模型
https://aws.amazon.com/blogs/...

您可以使用 Amazon Redshift ML 自动执行数据准备、预处理和问题类型的选择，如这篇博客文章中所述。我们假设您非常了解自己的数据以及最适用于您使用案例的问题类型。本文将专注于使用多分类问题类型在 Amazon Redshift 中创建模型，该类型包括至少三个类别。例如，您可以预测交易是欺诈性的、失败的还是成功的，客户是否会将活跃状态保持 3 个月、6 个月、9 个月、12 个月，还是要将新闻标记为体育、世界新闻或是商业内容。

博客文章
https://aws.amazon.com/blogs/...

先决条件

作为实施此解决方案的先决条件，您需要设置启用机器学习（ML）功能的 Amazon Redshift 集群。有关这些准备步骤，请参阅使用 SQL 和 Amazon Redshift ML 在 Amazon Redshift 中创建、训练和部署机器学习模型：
https://aws.amazon.com/blogs/...

使用案例

在我们的使用案例中，我们希望为一个特殊客户忠诚度计划找出最活跃的客户。我们使用 Amazon Redshift ML 和多分类模型来预测客户在 13 个月内将有多少个月内处于活动状态。这将转化为多达 13 个可能的分类，因此更适合采取多分类。预计活动状态将保持 7 个月或更长时间的客户将成为特殊客户忠诚度计划的目标群体。

输入原始数据

为了准备该模型的原始数据，我们使用公用数据集电子商务销售预测（其中包括英国在线零售商的销售数据）填充 Amazon Redshift 中的 ecommerce_sales 表。

电子商务销售预测
https://www.kaggle.com/alluni...

输入以下语句以将数据加载到 Amazon Redshift：

CREATE TABLE IF NOT EXISTS ecommerce_sales
(
    invoiceno VARCHAR(30)   
    ,stockcode VARCHAR(30)   
    ,description VARCHAR(60)    
    ,quantity DOUBLE PRECISION   
    ,invoicedate VARCHAR(30)    
    ,unitprice    DOUBLE PRECISION
    ,customerid BIGINT    
    ,country VARCHAR(25)    
)
;

Copy ecommerce_sales
From 's3://redshift-ml-multiclass/ecommerce_data.txt'
iam_role '<>' delimiter '\t' IGNOREHEADER 1 region 'us-east-1' maxerror 100;

要在您的环境中重现此脚本，请将

your-amazon-redshift-sagemaker-iam-role-arn

替换为适用于您的 Amazon Redshift 集群的 Amazon Identity and Access Management (Amazon IAM) ARN。

Amazon Identity and Access Management
http://aws.amazon.com/iam

机器学习（ML）模型的数据准备

现在我们的数据集已加载完毕，我们可以选择将数据拆分为三组分别进行训练 (80％)、验证 (10％) 和预测 (10％)。请注意，Amazon Redshift ML Autopilot 会自动将数据拆分为训练和验证，但是如果在此处就进行拆分，您将能够很好地验证模型的准确性。此外，我们将计算客户保持活跃的月数，因为我们希望模型能够根据新数据预测该值。我们在 SQL 语句中使用随机函数来拆分数据。请参阅以下代码：

create table ecommerce_sales_data as (
  select
    t1.stockcode,
    t1.description,
    t1.invoicedate,
    t1.customerid,
    t1.country,
    t1.sales_amt,
    cast(random() * 100 as int) as data_group_id
  from
    (
      select
        stockcode,
        description,
        invoicedate,
        customerid,
        country,
        sum(quantity * unitprice) as sales_amt
      from
        ecommerce_sales
      group by
        1,
        2,
        3,
        4,
        5
    ) t1
);

训练集

create table ecommerce_sales_training as (
  select
    a.customerid,
    a.country,
    a.stockcode,
    a.description,
    a.invoicedate,
    a.sales_amt,
    (b.nbr_months_active) as nbr_months_active
  from
    ecommerce_sales_data a
    inner join (
      select
        customerid,
        count(
          distinct(
            DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(
              DATE_PART(mon, cast(invoicedate as date)),
              2,
              '00'
            )
          )
        ) as nbr_months_active
      from
        ecommerce_sales_data
      group by
        1
    ) b on a.customerid = b.customerid
  where
    a.data_group_id < 80
);

验证集

create table ecommerce_sales_validation as (
  select
    a.customerid,
    a.country,
    a.stockcode,
    a.description,
    a.invoicedate,
    a.sales_amt,
    (b.nbr_months_active) as nbr_months_active
  from
    ecommerce_sales_data a
    inner join (
      select
        customerid,
        count(
          distinct(
            DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(
              DATE_PART(mon, cast(invoicedate as date)),
              2,
              '00'
            )
          )
        ) as nbr_months_active
      from
        ecommerce_sales_data
      group by
        1
    ) b on a.customerid = b.customerid
  where
    a.data_group_id between 80
    and 90
);

预测集

create table ecommerce_sales_prediction as (
  select
    customerid,
    country,
    stockcode,
    description,
    invoicedate,
    sales_amt
  from
    ecommerce_sales_data
  where
    data_group_id > 90);

在 Amazon Redshift 中创建模型

现在我们创建了训练和验证数据集，我们可以使用 Amazon Redshift 中的 create model 语句使用 Multiclass_Classification 创建我们的机器学习模型。我们指定问题类型，然后让 AutoML 处理其他的一切事务。在这个模型中，我们想要预测的目标是 nbr_months_active。Amazon SageMaker 创建了一个函数predict_customer_activity，我们将用它在 Amazon Redshift 中进行推断。请参阅以下代码：

create model ecommerce_customer_activity
from
  (
select   
  customerid,
  country,
  stockcode,
  description,
  invoicedate,
  sales_amt,
  nbr_months_active  
 from ecommerce_sales_training)
 TARGET nbr_months_active FUNCTION predict_customer_activity
 IAM_ROLE '<>'
 problem_type MULTICLASS_CLASSIFICATION  
  SETTINGS (
    S3_BUCKET '<>’,
    S3_GARBAGE_COLLECT OFF
  );

要在环境中重现此脚本，请将

your-amazon-redshift-sagemaker-iam-role-arn

替换为集群的 Amazon IAM 角色 ARN。

create model
https://docs.aws.amazon.com/r...
Amazon SageMaker
https://aws.amazon.com/sagema...

验证预测

在此步骤中，我们将对照验证数据评估机器学习（ML）模型的准确性。

在创建模型时，Amazon SageMaker Autopilot 会自动将输入数据拆分为训练和验证集，并选择具有最佳客观指标的模型，该指标部署在 Amazon Redshift 集群中。您可以使用集群中的 show model 语句查看各种指标，包括准确性分数。如果没有明确指定，Amazon SageMaker 会自动使用目标类型的准确性。请参阅以下代码：

Show model ecommerce_customer_activity;

Amazon SageMaker Autopilot
https://aws.amazon.com/sagema...

如以下输出所示，我们的模型的准确率为 0.996580。

让我们对验证数据使用以下 SQL 代码以对验证数据运行推理查询：

select 
 cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2))  as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from 
(select   
  customerid,
  country,
  stockcode,
  description,
  invoicedate,
  sales_amt,
  nbr_months_active,
  predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active,
  case when nbr_months_active = predicted_months_active then 1
      else 0 end as match,
  case when nbr_months_active <> predicted_months_active then 1
    else 0 end as nonmatch
  from ecommerce_sales_validation
  )t1;

可以看到，在我们的数据集上预测的准确率位 99.74％，这与 show model 中的准确率相符。

现在让我们运行一个查询，以至少活跃 7 个月为标准来查看哪些客户有资格参加我们的客户忠诚度计划：

select 
  customerid, 
  predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active
  from ecommerce_sales_prediction
 where predicted_months_active >=7
 group by 1,2
 limit 10;

下表显示了我们的输出结果。

问题排查

尽管 Amazon Redshift 中的 Create Model 语句自动负责启动 Amazon SageMaker Autopilot 流程以构建、训练和调整最佳机器学习模型并在 Amazon Redshift 中部署该模型，但您可以查看在此过程中执行的中间步骤，如果出现问题，这还可以帮助您进行故障排除。您还可以从 show model 命令的输出中检索 AutoML Job Name。

创建模型时，您需要设置一个 Amazon Simple Storage Service (Amazon S3) 存储桶名称作为参数 s3_bucket 的值。您可以使用此存储桶在 Amazon Redshift 和 Amazon SageMaker 之间共享训练数据和构件。Amazon Redshift 会在此存储桶中创建一个子文件夹保存训练数据。训练完成后，除非将参数 s3_garbage_collect 设置为 off（可用于故障排除），否则它会删除子文件夹及其内容。有关更多信息，请参阅 CREATE MODEL。