CREATE TABLE dim_product
定义缓慢变化维(SCD)以支持历史追踪。-- 星型模型示例
CREATE TABLE fact_sales (
product_sk INT,
time_sk INT,
amount DECIMAL(18,2)
) PARTITIONED BY (dt STRING);
INSERT INTO dwd_order
SELECT
order_id,
COALESCE(user_id, -1) AS user_id, -- 空值处理
CAST(amount AS DECIMAL(16,2)) AS amount -- 类型强制转换
FROM ods_order
WHERE dt='2023-08-20';
组件类型 | 开源方案 | 云原生方案 |
---|---|---|
流处理引擎 | Flink | Kinesis Data Analytics |
实时存储 | Apache Druid | Amazon Timestream |
可视化工具 | Apache Superset | QuickSight |
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("amount", 0, 1000000)
<policy name="Sales-Data-Access">
<resources><table>fact_orderstable>resources>
<accessTypes>SELECTaccessTypes>
<roles>BI-Analystroles>
policy>
技术方案:
[APP日志] -> [Kafka] -> [Flink实时计算] -> [ClickHouse]
-> [Spark离线ETL] -> [Hive DWD]
关键指标SQL:
WITH dau AS (
SELECT dt, COUNT(DISTINCT user_id) AS uv
FROM dwd_user_behavior
WHERE event='launch' GROUP BY dt
)
SELECT a.dt, ROUND(b.uv*100.0/a.uv,2) AS 7d_retention
FROM dau a LEFT JOIN dau b ON b.dt = DATE_ADD(a.dt,7)
流批一体架构:
SCD Type2实现方案:
MERGE INTO dim_user AS target
USING (SELECT user_id, address FROM staging) AS source
ON target.user_id = source.user_id
WHEN MATCHED THEN
UPDATE SET end_dt = CURRENT_DATE
WHEN NOT MATCHED THEN
INSERT (user_id, address, start_dt)
VALUES (source.user_id, source.address, CURRENT_DATE)
问题:星型模型与雪花模型的核心区别是什么?
答案:星型模型通过维度表冗余提升查询性能,雪花模型通过规范化减少存储空间但增加关联复杂度
问题:ETL过程中常见的数据质量问题有哪些?
答案:空值异常(8.3%)、值域越界(如金额为负)、枚举值不符(如状态码错误),需通过Great Expectations等工具检测
问题:如何评估实时BI架构的可行性?
答案:从数据延迟(<1s)、吞吐量(10w+TPS)、故障恢复时间(<30s)三个维度进行压力测试
下期预告:《Kimball维度建模》
互动话题:你在学习SQL时遇到过哪些坑?欢迎评论区留言讨论!
️温馨提示:我是[随缘而动,随遇而安], 一个喜欢用生活案例讲技术的开发者。如果觉得有帮助,点赞关注不迷路