利用Amazon Mobile Analytics与Amazon Redshift实现规模化数据分析

在今天的博文当中,我将为大家介绍几款实用工具。这些工具能够切实帮助各位简化针对应用程序用户参与度、客户驻留情况、用户习惯、项目购买情况以及其它重要指标的分析工作。除了分步引导之外,我还将列出SQL查询范例以推进各位的学习进度。

今天将要涉及的工具分别为Amazon Mobile Analytics Auto Export以及Amazon Redshift。如果各位朋友对这两者尚不熟悉,请容我首先进行简要介绍。Amazon Mobile Analytics是一项服务,旨在帮助用户以便捷且更具成本效率的方式收集并分析自己的应用程序使用情况数据。除了以快速参考方式为您带来使用情况汇总图表之外,Amazon Mobile Analytics还允许大家通过设置将相关数据自动导出至Amazon S3当中、进而交付至其它数据分析工具——例如Amazon Redshift。触手可及的移动分析事件(其中包括自定义属性与指标)与性能出众的Amazon Redshift等数据仓库相结合之后,大家将能够通过直接运行查询指令或者第三方可视化工具——例如Tableau或者其它Amazon Redshift合作项目——以直观方式了解与应用程序使用情况相关的具体信息。

利用Amazon Mobile Analytics与Amazon Redshift实现规模化数据分析_第1张图片

当数据进入RedShift之后,我们就能够对自己的业务运转态势作出深度解读。举例来说,大家可以通过数据分析实现以下目标:

  • 根据预先定义获取来自不同自定义类别的客户参与度、驻留水平以及用户活动分析结论。
  • 针对特定用户群体罗列应用程序商店当中销售量最高的购买项目。
  • 了解用户在应用程序内的导航流程(自定义漏斗)。

在完成以上步骤之后,大家将能够在Amazon Redshift内针对自己的事件运行查询指令,并让Redshift表中的数据始终与新事件保持一致。以下为各关键步骤汇总:

  1. 将Amazon Mobile Analytics整合至应用程序当中。
  2. 在Amazon Mobile Analytics当中开启Auto Export to Amazon S3功能。
  3. 创建并设置一套新的Amazon Redshift集群(或者使用一套现有集群)。
  4. 创建一个AWS身份与访问管理(简称IAM)用户,从而访问来自Amazon S3存储桶中的事件。
  5. 配置一个“jsonpaths”文件,旨在将事件中的全部细节映射至Amazon Redshift表中的各列处。
  6. 通过指令要求Amazon Redshift利用“jsonpaths”文件以及IAM证书加载来自Amazon S3的事件。
  7. 进行数据查询!

第一步:将Amazon Mobile Analytics整合至应用程序当中

我们曾经在之前的文章当中向大家介绍过如何上手Amazon Mobile Analytics与自有应用程序的结合工作。只需要将AWS Mobile SDK添加到我们的iOS以及Android/Fire OS应用程序当中,或者利用Amazon Mobile Analytics REST API,而后通过AWS控制台访问对应Amazon Mobile Analytics仪表板以查看应用程序的具体使用情况。

第二步:利用Amazon Mobile Analytics Auto Export将数据导出至Amazon S3

当我们将应用程序与Amazon Mobile Analytics加以整合、且后者开始正常发送事件之后,接下来要做的就是启用Auto Export to Amazon S3功能。请大家点击此处并遵循其中的步骤以完成操作。需要注意的是,请大家务必牢记自己所使用的Amazon S3存储桶名称,因为我们在之后的步骤当中还将继续使用。

第三步:创建一套Amazon Redshift集群

如果大家还没有自己的一套Amazon Redshift集群,请点击此处了解其创建步骤(同时确保认真阅读了定价说明以及各免费选项介绍)。作为起步,大家可以首先创建一套小型Amazon Redshift集群,而后根据需求调整节点数量与集群类型以进行向上扩展。大家还可以重新创建自己的Amazon Redshift集群,并在后续使用当中随时对全部事件进行重新载入。当我们的Amazon Redshift上线并开始运行,接下来要做的是安装SQL Workbench或者其它能够用于在Amazon Redshift上运行查询指令的SQL客户端。

第四步:创建一个AWS IAM用户以实现Amazon S3存储桶访问
  1. 大家可以通过访问AWS控制台中的“Identity and Access Management”选项创建一个AWS IAM用户并为其生成访问密钥。请大家点击此处了解更为具体的操作步骤指南。
  2. 为该用户下载这套访问密钥及私钥。
  3. 利用以下模板将用户政策添加至第一部分创建完成的用户当中。这套政策允许用户以读取方式访问我们的Amazon S3存储桶(进而实现Amazon Redshift对事件的读取)。将其中的<YOUR-BUCKET-NAME>修改为我们事件导出时所指向的Amazon S3存储桶名称。
{
  "Statement": [
    {
      "Resource": [
        "arn:aws:s3:::<YOUR-BUCKET-NAME>*"
      ],
      "Action": [
        "s3:ListBucket", "s3:GetObject*"
      ],
      "Effect": "Allow"
    }
  ],
  "Version": "2012-10-17"
}

第五步:创建表并映射文件(利用SQL Workbench)

  1. 在我们的Amazon S3存储桶内创建一个名为“jsonpaths”的文件夹。
  2. 将以下文本内容添加至名为eventmapping.json的文件当中,并将其上传至:
    <YOUR BUCKET NAME>/jsonpaths/eventmapping.json
  3. Place the names of any custom metrics and attributes at the end 在(update $['attributes']['{custom attribute name}']与$['metrics']['{custom metric name}'] )末尾处设定任意自定义指标与属性。
{
  "jsonpaths": [
    "$['event_type']", 
    "$['event_timestamp']", 
    "$['arrival_timestamp']", 
    "$['event_version']", 
    "$['application']['app_id']", 
    "$['application']['package_name']", 
    "$['application']['version_name']", 
    "$['application']['version_code']", 
    "$['application']['title']", 
    "$['application']['cognito_identity_pool_id']", 
    "$['application']['sdk']['name']", 
    "$['application']['sdk']['version']", 
    "$['client']['client_id']", 
    "$['client']['cognito_id']", 
    "$['device']['model']", 
    "$['device']['make']", 
    "$['device']['platform']['name']", 
    "$['device']['platform']['version']", 
    "$['device']['locale']['code']", 
    "$['device']['locale']['language']", 
    "$['device']['locale']['country']", 
    "$['session']['session_id']", 
    "$['session']['start_timestamp']", 
    "$['session']['stop_timestamp']", 
    "$['monetization']['transaction']['transaction_id']", 
    "$['monetization']['transaction']['store']", 
    "$['monetization']['transaction']['item_id']", 
    "$['monetization']['transaction']['quantity']", 
    "$['monetization']['transaction']['price']['reported_price']", 
    "$['monetization']['transaction']['price']['amount']", 
    "$['monetization']['transaction']['price']['currency']['code']", 
    "$['monetization']['transaction']['price']['currency']['symbol']", 
    "$['attributes']['class']", 
    "$['attributes']['level']", 
    "$['attributes']['name']", 
    "$['attributes']['paying customer']", 
    "$['metrics']['score']", 
    "$['metrics']['time played']", 
    "$['metrics']['total spent']"
  ]
}

创建表格以列举并存储事件,并通过一套视图对其加以访问。在Amazon Redshift当中修改并运行以下脚本以创建表格及视图。

  1. 对表中各以a_及m_开头的列进行修改:为其指定我们希望包含在Amazon Redshift表当中的自定义属性与自定义指标这些列将被映射至前文中所提到的jsonpaths文件内的属性与指标处。
  2. 我们建议大家将全部自定义属性以“a_”作为前缀,并以“m_”作为自定义指标前缀,这样不仅更易于识别、同时也能避免其与其它列中的命名发生冲突。
CREATE schema AWSMA;
  
--创建此表用于临时容纳加载自Amazon S3的事件
CREATE TABLE AWSMA.event_staging(
   event_type VARCHAR(256) NOT NULL ENCODE LZO,
   event_timestamp TIMESTAMP NOT NULL ENCODE LZO,
   arrival_timestamp TIMESTAMP NOT NULL ENCODE LZO,
   event_version CHAR(12) NULL ENCODE LZO,
   application_app_id VARCHAR(64) NOT NULL ENCODE LZO,
   application_package_name VARCHAR(256) NULL ENCODE LZO,
   application_version_name VARCHAR(256) NULL ENCODE LZO,
   application_version_code VARCHAR(256) NULL ENCODE LZO,
   application_title VARCHAR(256) NULL ENCODE LZO,
   application_cognito_identity_pool_id VARCHAR(64) NULL ENCODE LZO,
   application_sdk_name VARCHAR(256) NULL ENCODE LZO,
   application_sdk_version VARCHAR(256) NULL ENCODE LZO,
   client_id VARCHAR(64) NOT NULL DISTKEY ENCODE LZO,
   client_cognito_id VARCHAR(64) NULL ENCODE LZO,
   device_model VARCHAR(256) NULL ENCODE LZO,
   device_make VARCHAR(256) NULL ENCODE LZO,
   device_platform_name VARCHAR(256) NULL ENCODE LZO,
   device_platform_version VARCHAR(256) NULL ENCODE LZO,
   device_locale_code VARCHAR(256) NULL ENCODE LZO,
   device_locale_language VARCHAR(64) NULL ENCODE LZO,
   device_locale_country VARCHAR(64) NULL ENCODE LZO,
   session_id VARCHAR(64) NULL ENCODE LZO,
   session_start_timestamp TIMESTAMP NULL ENCODE LZO,
   session_stop_timestamp TIMESTAMP NULL ENCODE LZO,
   monetization_transaction_id VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_store VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_item_id VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_quantity FLOAT8 NULL,
   monetization_transaction_price_reported VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_price_amount FLOAT8 NULL,
   monetization_transaction_price_currency_code VARCHAR(16) NULL ENCODE LZO,
   monetization_transaction_price_currency_symbol VARCHAR(32) NULL ENCODE LZO,
   a_class             VARCHAR(4000),
   a_level             VARCHAR(4000),
   a_name              VARCHAR(4000),
   "a_paying customer" VARCHAR(4000),
   m_score             float8,
   "m_time played"     float8,
   "m_total spent"     float8
)
SORTKEY ( application_app_id, event_timestamp, event_type);
  
--创建此表用于保存全部事件
CREATE TABLE AWSMA.event(
   event_type VARCHAR(256) NOT NULL ENCODE LZO,
   event_timestamp TIMESTAMP NOT NULL ENCODE LZO,
   arrival_timestamp TIMESTAMP NOT NULL ENCODE LZO,
   event_version CHAR(12) NULL ENCODE LZO,
   application_app_id VARCHAR(64) NOT NULL ENCODE LZO,
   application_package_name VARCHAR(256) NULL ENCODE LZO,
   application_version_name VARCHAR(256) NULL ENCODE LZO,
   application_version_code VARCHAR(256) NULL ENCODE LZO,
   application_title VARCHAR(256) NULL ENCODE LZO,
   application_cognito_identity_pool_id VARCHAR(64) NULL ENCODE LZO,
   application_sdk_name VARCHAR(256) NULL ENCODE LZO,
   application_sdk_version VARCHAR(256) NULL ENCODE LZO,
   client_id VARCHAR(64) NOT NULL DISTKEY ENCODE LZO,
   client_cognito_id VARCHAR(64) NULL ENCODE LZO,
   device_model VARCHAR(256) NULL ENCODE LZO,
   device_make VARCHAR(256) NULL ENCODE LZO,
   device_platform_name VARCHAR(256) NULL ENCODE LZO,
   device_platform_version VARCHAR(256) NULL ENCODE LZO,
   device_locale_code VARCHAR(256) NULL ENCODE LZO,
   device_locale_language VARCHAR(64) NULL ENCODE LZO,
   device_locale_country VARCHAR(64) NULL ENCODE LZO,
   session_id VARCHAR(64) NULL ENCODE LZO,
   session_start_timestamp TIMESTAMP NULL ENCODE LZO,
   session_stop_timestamp TIMESTAMP NULL ENCODE LZO,
   monetization_transaction_id VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_store VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_item_id VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_quantity FLOAT8 NULL,
   monetization_transaction_price_reported VARCHAR(64) NULL ENCODE LZO,
   monetization_transaction_price_amount FLOAT8 NULL,
   monetization_transaction_price_currency_code VARCHAR(16) NULL ENCODE LZO,
   monetization_transaction_price_currency_symbol VARCHAR(32) NULL ENCODE LZO,
   a_class             VARCHAR(4000),
   a_level             VARCHAR(4000),
   a_name              VARCHAR(4000),
   "a_paying customer" VARCHAR(4000),
   m_score             float8,
   "m_time played"     float8,
   "m_total spent"     float8
)
SORTKEY ( application_app_id, event_timestamp, event_type);
  
--创建一套视图
CREATE OR REPLACE VIEW AWSMA.v_event AS SELECT * FROM AWSMA.event;

第六步:从Amazon S3当中将事件加载至Amazon Redshift(利用SQL Workbench)

要从Amazon S3当中将事件加载至AWSMA.event表,我们需要分两步实现:

1. 利用复制命令将一定时间段内的事件由Amazon S3中复制至AWSMA.event表当中。

Copy events from Amazon S3将来自AWSMA.event分段表的事件插入到ASMA.event表当中,从而保证不会将重复部分插入至AWSMA.event表。
  1. 要将数据加载至Amazon Redshift当中,请使用下面提供的COPY脚本。在执行该脚本之前,请确保完成以下内容修正:将<YOUR BUCKET NAME>替换为事件导出所指向的Amazon S3存储桶名称。
  2. 将<APP ID>替换为我们希望导入的应用程序App ID。
  3. 将<ACCESS KEY>替换为我们用于访问Amazon S3存储桶的IAM用户访问密钥。
  4. 将<SECRET KEY>替换为我们用于访问Amazon Se存储桶的IAM用户私钥。
  5. 提取我们需要进行数据加载的时间段。<YEAR>/<MONTH>/<DAY>分别用于指定包含需要导入事件的项目文件的对应前缀日期。根据我们实际需要加载事件的时间与数量,大家应该选定最合适的对应前缀:
    • 2015/01/20将加载发送于2015年1月20日的全部事件。
    • 2015/0将加载发送日期在2015年1月至2015年9月(含1月与9月)之间的全部事件。
    • 2015将加载发送日期在今年之内的全部事件。

更新列清单以包含我们需要加载的各自定义指标与属性名称。此处指定的列清单必须与jsonpaths文件保证顺序相同。

--首先截断该分段表
TRUNCATE AWSMA.event_staging;
  
--复制来自Amazon S3的事件
COPY 
   AWSMA.event_staging(
      "event_type",
      "event_timestamp",
      "arrival_timestamp",
      "event_version",
      "application_app_id",
      "application_package_name",
      "application_version_name",
      "application_version_code",
      "application_title",
      "application_cognito_identity_pool_id",
      "application_sdk_name",
      "application_sdk_version",
      "client_id",
      "client_cognito_id",
      "device_model",
      "device_make",
      "device_platform_name",
      "device_platform_version",
      "device_locale_code",
      "device_locale_language",
      "device_locale_country",
      "session_id",
      "session_start_timestamp",
      "session_stop_timestamp",
      "monetization_transaction_id",
      "monetization_transaction_store",
      "monetization_transaction_item_id",
      "monetization_transaction_quantity",
      "monetization_transaction_price_reported",
      "monetization_transaction_price_amount",
      "monetization_transaction_price_currency_code",
      "monetization_transaction_price_currency_symbol",
      "a_class",
      "a_level",
      "a_name",
      "a_paying customer",
      "m_score",
      "m_time played",
      "m_total spent"
   )
FROM
   's3://<YOUR BUCKET NAME>/awsma/events/<APP ID>/<YEAR>/<MONTH>/<DAY>'
credentials 
   'aws_access_key_id=<ACCESS KEY>;aws_secret_access_key=<SECRET KEY>'
EMPTYASNULL gzip TIMEFORMAT AS 'epochmillisecs' MAXERROR AS 0 
JSON AS 's3://<YOUR BUCKET NAME>/jsonpaths/eventmapping.json';

将加载事件插入至AWSMA.event表

现在各事件已经顺利进入AWSMA.event分段表,未来大家还需要向AWSMA.event表中插入更多新事件(即那些尚未进入AWSMA.event表的事件)。如此一来,我们就能够针对当前时段进行事件重新加载(举例来说,载入今年之内的全部最新事件),同时继续保持AWSMA.event表中的其它事件不受影响。

使用以下命令将来自AWSMA.event分段表内的数据加载至AWSMA.event表:

更新该列清单以添加我们需要加载的自定义指标与属性名称。

--从分段表中加载各尚未进入AWSMA.event表的事件
INSERT INTO AWSMA.event (
      "event_type",
      "event_timestamp",
      "arrival_timestamp",
      "event_version",
      "application_app_id",
      "application_package_name",
      "application_version_name",
      "application_version_code",
      "application_title",
      "application_cognito_identity_pool_id",
      "application_sdk_name",
      "application_sdk_version",
      "client_id",
      "client_cognito_id",
      "device_model",
      "device_make",
      "device_platform_name",
      "device_platform_version",
      "device_locale_code",
      "device_locale_language",
      "device_locale_country",
      "session_id",
      "session_start_timestamp",
      "session_stop_timestamp",
      "monetization_transaction_id",
      "monetization_transaction_store",
      "monetization_transaction_item_id",
      "monetization_transaction_quantity",
      "monetization_transaction_price_reported",
      "monetization_transaction_price_amount",
      "monetization_transaction_price_currency_code",
      "monetization_transaction_price_currency_symbol",
      "a_class",
      "a_level",
      "a_name",
      "a_paying customer",
      "m_score",
      "m_time played",
      "m_total spent")
SELECT
      staging."event_type",
      staging."event_timestamp",
      staging."arrival_timestamp",
      staging."event_version",
      staging."application_app_id",
      staging."application_package_name",
      staging."application_version_name",
      staging."application_version_code",
      staging."application_title",
      staging."application_cognito_identity_pool_id",
      staging."application_sdk_name",
      staging."application_sdk_version",
      staging."client_id",
      staging."client_cognito_id",
      staging."device_model",
      staging."device_make",
      staging."device_platform_name",
      staging."device_platform_version",
      staging."device_locale_code",
      staging."device_locale_language",
      staging."device_locale_country",
      staging."session_id",
      staging."session_start_timestamp",
      staging."session_stop_timestamp",
      staging."monetization_transaction_id",
      staging."monetization_transaction_store",
      staging."monetization_transaction_item_id",
      staging."monetization_transaction_quantity",
      staging."monetization_transaction_price_reported",
      staging."monetization_transaction_price_amount",
      staging."monetization_transaction_price_currency_code",
      staging."monetization_transaction_price_currency_symbol",
      staging."a_class",
      staging."a_level",
      staging."a_name",
      staging."a_paying customer",
      staging."m_score",
      staging."m_time played",
      staging."m_total spent"
FROM AWSMA.event_staging staging
LEFT JOIN
   AWSMA.event events ON events.event_timestamp = staging.event_timestamp 
   AND events.application_app_id = staging.application_app_id 
   AND events.client_id = staging.client_id
WHERE
   events.client_id IS NULL;

祝贺大家!现在我们的事件已经做好接受查询的准备了。

如果大家希望进一步将其它事件载入Amazon Redshift,那么可能只需要执行最后三条语句(truncate、copy以及insert),并指定我们需要进行加载之数据的具体日期(即Amazon S3文件夹名称)。多次重新加载同一时段内的事件时,AWSMA.event表中只会纳入新增事件,而不会出现任何事件重复。

第七步:创建Amazon Redshift以实现数据查询(利用SQL Workbench)

虽然我们确实可以利用同一个用户进行数据插入与查询,但从最佳实践的角度看,以群组方式创建多个Amazon Redshift以访问AWSMA.v_event视图并在通过Tableau等客户端进行数据查询时使用这些用户才是最理想的作法。此类用户只具备读取访问能力,而无法对数据进行修改或者删除。

--创建eventReaders群组
CREATE GROUP eventReaders;
--Configure AWSMA schema privileges
GRANT usage ON schema AWSMA TO GROUP eventReaders;
GRANT SELECT ON AWSMA.v_event TO GROUP eventReaders;
--创建eventReader用户
CREATE USER <username> IN GROUP eventReaders PASSWORD '<password>';
现在开始,大家即可顺利进行查询。
从时长角度出发,三十天内活跃用户与设备。

现在开始,大家即可顺利进行查询。

从时长角度出发,三十天内活跃用户与设备。

SELECT
    application_app_id AS "app id", 
    COUNT(DISTINCT client_id) AS "devices", 
    COUNT(DISTINCT client_cognito_id) AS "users",
    date_trunc('day', event_timestamp) AS "day"
FROM
    AWSMA.v_event
WHERE
    event_type = '_session.start' AND
    event_timestamp BETWEEN getdate() - 30 AND getdate() + 1
GROUP BY
    "app id", 
    "day"
ORDER BY
    "app id" ASC, 
    "day" DESC

从设备语言代码出发,三十天内活跃用户与设备。

SELECT
    application_app_id AS "app id", 
    COUNT(DISTINCT client_id) AS "devices", 
    COUNT(DISTINCT client_cognito_id) AS "users",
    device_locale_language AS "language"
FROM
    AWSMA.v_event
WHERE
    event_type = '_session.start' AND
    event_timestamp BETWEEN getdate() - 30 AND getdate() + 1
GROUP BY
    "app id", 
    "language"
ORDER BY
    "app id" ASC, 
    "devices" DESC,
    "language"
;

从应用程序版本出发,三十天内活跃用户与设备。

SELECT
    application_app_id AS "app id", 
    device_platform_name AS "platform",
    application_version_name AS "version name",
    application_version_code AS "version code",
    COUNT(DISTINCT client_id) AS "devices", 
    COUNT(DISTINCT client_cognito_id) AS "users"
FROM
    AWSMA.v_event
WHERE
    event_type = '_session.start' AND
    event_timestamp BETWEEN getdate() - 30 AND getdate() + 1
GROUP BY
    "app id", 
    "platform",
    "version name",
    "version code"
ORDER BY
    "app id" ASC, 
    "platform" ASC,
    "devices" DESC,
    "version name" DESC,
    "version code" DESC
;

过去三十天中销售量最高之项目。

SELECT
    application_app_id AS "app id", 
    monetization_transaction_item_id AS "item id",
    monetization_transaction_store AS "store",
    COUNT(DISTINCT client_id) AS "devices", 
    COUNT(DISTINCT client_cognito_id) AS "users",
    SUM(monetization_transaction_quantity) AS "quantity",
    SUM(monetization_transaction_price_amount) "amount (Apple only)",
    monetization_transaction_price_currency_code AS "currency (Apple only)"
FROM
    AWSMA.v_event
WHERE
    event_type = '_monetization.purchase' AND
    event_timestamp BETWEEN getdate() - 30 AND getdate() + 1
GROUP BY
    "app id",
    "item id",
    "currency (Apple only)",
    "store"
ORDER BY
    "app id" ASC, 
    "item id" ASC,
    "quantity" DESC,
    "store",
    "devices" DESC
;

感谢大家的耐心阅读,我们期待着各位提出自己的意见与反馈。请点击此处在我们的论坛上留下您的真知灼见。

原文链接:http://mobile.awsblog.com/post/Tx2868860KKAM0E/Score-big-with-Amazon-Mobile-Analytics-and-Amazon-Redshift

你可能感兴趣的:(利用Amazon Mobile Analytics与Amazon Redshift实现规模化数据分析)