ClickHouse的安装(含集群方式)和使用

Home     |    GitHub

目录

  • 1 概述
    • 1.1 何时使用ClickHouse
    • 1.2 何时不使用ClickHouse
    • 1.3 特点
      • 1.3.1 非常燃的速度(Blazing Fast)
      • 1.3.2 线性可扩展(Linearly Scalable)
      • 1.3.3 硬件效率高(Hardware Efficient)
      • 1.3.4 容错(Fault Tolerant)
      • 1.3.5 功能丰富(Feature Rich)
      • 1.3.6 高度可靠(Highly Reliable)
      • 1.3.7 简单方便(Simple and Handy)
  • 2 架构概述
  • 3 安装部署
    • 3.1 安装之前
    • 3.2 单节点方式
    • 3.3 集群方式
  • 4 客户端工具
    • 4.1 clickhouse-client
    • 4.2 DBeaver
    • 4.3 JDBC
  • 5 数据类型
  • 6 表引擎
  • 7 数据库引擎
  • 8 示例
    • 8.1 获取数据
    • 8.2 插入数据
    • 8.3 创建分布式表并加载数据
    • 8.4 执行SQL
    • 8.5 和单节点查询对比
  • 9 数据的导入导出
    • 9.1 获取基准测试数据
    • 9.2 建表并加载数据
    • 9.3 数据的导出
    • 9.4 数据的导入
  • 10 Movie 数据分析
  • 11 ClickHouse-China/ClickhouseMeetup



1 概述

ClickHouse是俄罗斯的Yandex公司(名字来一种说法是源于Yet Another Indexer)于2016年开源的用于联机分析(OLAP)的面向列的数据库管理系统(DBMS),它能够使用SQL查询实时生成分析数据报告。

ClickHouse,会让我们的思考决策更快:

  • 在相同的时间内运行更多查询
  • 测试更多的预想
  • 以更多新方式对数据进行切片和切块
  • 从新角度看待你的数据
  • 发现新维度

1.1 何时使用ClickHouse

用于分析或清洗结构化的且不可变的事件或日志类数据。建议将每个此类流放入具有预先连接(pre-joined)维度的单个宽事实表中。

一些可行的应用程序示例:

  • 网络和应用分析
  • 广告网络和实时出价
  • 电信
  • 电子商务和金融
  • 信息安全
  • 监控和遥测
  • 时间序列
  • 商业情报
  • 线上游戏
  • 物联网

1.2 何时不使用ClickHouse

  • 事务性工作负载(OLTP)
  • 高请求率的键值访问
  • Blob或文档存储
  • 标准化数据

1.3 特点

1.3.1 非常燃的速度(Blazing Fast)

ClickHouse的性能超过了目前市场上可对比的面向列的DBMS。每秒钟每台服务器每秒处理数亿至十亿多行和数十千兆字节的数据。ClickHouse会充分利用所有可用的硬件,以尽可能快地处理每个查询。单个查询(解压缩后、仅使用的列)的峰值处理性能超过每秒2 TB。ClickHouse的工作速度比传统方法快100-1000倍。与常见的数据管理方法不同,在常规方法中,大量原始格式的原始数据可作为任何给定查询的“数据湖”(data lake),而ClickHouse在大多数情况下可提供即时结果,处理数据的速度比创建数据的速度更快。

1.3.2 线性可扩展(Linearly Scalable)

ClickHouse允许公司在必要时将服务器添加到群集中,而无需花费时间或金钱进行任何其他DBMS修改。该系统已成功为Yandex.Metrica提供服务,而其主要生产集群中的服务器数量在两年内已从60台增加到394台,这些服务器位于六个地理分布的数据中心内,ClickHouse可以在垂直和水平方向上很好地缩放。ClickHouse易于调整以在具有数百个节点的群集上,在单个服务器上甚至在小型虚拟机上执行。当前每个节点的安装量超过2万亿行,每个节点的存储量为100Tb。

1.3.3 硬件效率高(Hardware Efficient)

与具有相同可用 I/O 吞吐量的传统的面向行的系统相比,ClickHouse处理典型的分析查询要快两到三个数量级。系统的列式存储格式允许将更多热数据放入RAM中,从而缩短了响应时间。ClickHouse可以最大程度地减少范围查询的次数,从而提高了使用旋转磁盘驱动器的效率,因为它可以保持连续存储数据的参考位置。由于ClickHouse的矢量化查询执行涉及相关的处理器指令和运行时代码生成,因此它具有CPU效率。通过最大限度地减少大多数查询类型的数据传输,ClickHouse使公司无需使用专门针对高性能计算的专用网络即可管理其数据并创建报告。

1.3.4 容错(Fault Tolerant)

ClickHouse支持多主机异步复制,并且可以跨多个数据中心进行部署。单个节点或整个数据中心的停机时间不会影响系统的读写可用性。分布式读取将自动与活动副本保持平衡,以避免增加延迟。服务器停机后,复制的数据将自动或半自动同步。

1.3.5 功能丰富(Feature Rich)

ClickHouse具有用户友好的SQL查询方言,具有许多内置分析功能,例如它包括概率数据结构,用于快速和有效存储基数和分位数的计算。有用于工作日期、时间和时区的功能,以及一些专门的功能,例如寻址URL和IP(IPv4和IPv6)以及更多功能。ClickHouse中可用的数据组织选项,例如数组、数组联接、元组和嵌套数据结构,对于管理非规范化数据非常有效。

由于系统支持本地联接和分布式联接,因此使用ClickHouse可以联接分布式数据或位于同一地点的数据。它还提供了使用外部词典(从外部源加载的维度表)进行简单语法无缝连接的机会(例如MySQL)。ClickHouse支持近似查询处理–您可以根据需要快速获得结果,这在处理TB级和PB级数据时必不可少。系统的条件汇总函数,总计和极值的计算,使我们可以通过一次查询获得结果,而不必运行多个查询。

1.3.6 高度可靠(Highly Reliable)

ClickHouse一直在管理PB级数据,这些数据为俄罗斯领先的搜索提供商,欧洲最大的IT公司之一Yandex的大量高负载大众受众服务提供服务。自2012年以来,ClickHouse一直为公司的网络分析服务,比较电子商务平台,公共电子邮件服务,在线广告平台,商业智能工具和基础架构监视提供强大的数据库管理。ClickHouse可以配置为位于独立节点上的纯分布式系统,而没有任何单点故障。

软件和硬件故障或配置错误不会导致数据丢失。ClickHouse不会删除“损坏的”数据,而是将其保存或询问您在启动前该怎么做。每次对磁盘或网络进行读取或写入之前,所有数据均经过校验和。几乎不可能偶然删除数据,因为即使存在人为错误,也有保护措施。

ClickHouse提供了对查询复杂性和资源使用情况的灵活限制,可以通过设置对其进行微调。可以同时为多个高优先级低延迟请求和一些具有后台优先级的长时间运行的查询提供服务。

1.3.7 简单方便(Simple and Handy)

ClickHouse简化了所有数据处理,它易于使用,将所有结构化数据提取到系统中,并可立即用于报告。可以随时轻松将用于新属性或维度的新列添加到系统中,而不会降低系统运行速度。ClickHouse简单易用,开箱即用。除了在数百个节点群集上执行之外,该系统还可以轻松地安装在单个服务器甚至虚拟机上。安装ClickHouse不需要任何开发经验或代码编写技能。


2 架构概述

ClickHouse 是一个真正的列式数据库管理系统(DBMS)。在 ClickHouse 中,数据始终是按列存储的,包括矢量(向量或列块)执行的过程。只要有可能,操作都是基于矢量进行分派的,而不是单个的值,这被称为“矢量化查询执行”,它有利于降低实际的数据处理开销。

通常有两种不同的加速查询处理的方法:矢量化查询执行(vectorized query execution)和运行时代码生成(runtime code generation)。在后者中,动态地为每一类查询生成代码,消除了间接分派和动态分派。这两种方法中,并没有哪一种严格地比另一种好。运行时代码生成可以更好地将多个操作融合在一起,从而充分利用 CPU 执行单元和流水线。矢量化查询执行不是特别实用,因为它涉及必须写到缓存并读回的临时向量。如果 L2 缓存容纳不下临时数据,那么这将成为一个问题。但矢量化查询执行更容易利用 CPU 的 SIMD 功能(Single Instruction Multiple Data,单指令多数据流)。关于将两种方法结合起来的更好选择的研究可以查看论文 Vectorization vs. Compilation in Query Execution。使用了矢量化查询执行,同时初步提供了有限的运行时动态代码生成。

关于ClickHouse架构的了解,需要我们理解如下17个概念:列(Columns)、字段(Field)、Leaky Abstractions、数据类型(Data Types)、块(Block)、块流(Block Streams)、格式(Formats)、I/O、表(Tables)、解析器(Parsers)、解释器(Interpreters)、函数(Functions)、聚合函数(Aggregate Functions)、服务(Server)、分布式查询执行(Distributed Query Execution)、合并树(Merge Tree)、复制(Replication)。关于其详细的解释推荐查看官网引文版文档 Overview of ClickHouse Architecture。

3 安装部署

3.1 安装之前

首先需要保证我们的系统使用的是x86_64处理器构架的Linux系统,并且支持SSE 4.2指令集,可以执行如下命令检查是否支持SSE 4.2,如果返回SSE 4.2 supported表示支持,则可以继续下面的安装。同时终端必须使用UTF-8编码。

grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"

同时我们最好再调整一下CentOS系统对打开文件数的限制,在/etc/security/limits.conf/etc/security/limits.d/*-nproc.conf这2个文件的末尾加入一下内容。

*               soft    nofile          65536
*               hard    nofile          65536
*               soft    nproc          131072
*               hard    nproc          131072

修改完毕之后,SSH工具重新连接,再次登录后,执行如下命令查看,如果输出的值是我们设置的则表示已生效。

# 查看
ulimit -n

3.2 单节点方式

如果服务器可以连接网络,则可以直接通过yum方式安装,执行如下命令,如果是普通用户需要有sudo权限。下面的命令会自动下载资源安装稳定版的ClickHouse,如果需要安装最新版,把stable替换为testing。

# CentOS / RedHat
sudo yum install yum-utils
sudo rpm --import https://repo.yandex.ru/clickhouse/CLICKHOUSE-KEY.GPG
sudo yum-config-manager --add-repo https://repo.yandex.ru/clickhouse/rpm/stable/x86_64
sudo yum install clickhouse-server clickhouse-client

Yandex ClickHouse团队建议我们使用官方预编译的rpm软件包,用于CentOS、RedHat和所有其他基于rpm的Linux发行版。这种方式比较适合无法方位外网的生产环境安装,因此下面将主要采用这种方式安装和部署。ClickHouse的rpm包可以访问Download下载所需版本的安装包,执行如下命令下载资源包并安装。

# 1 下载
wget --content-disposition https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-server-19.16.3.6-1.el7.x86_64.rpm/download.rpm
wget --content-disposition https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-client-19.16.3.6-1.el7.x86_64.rpm/download.rpm
wget --content-disposition https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-common-static-19.16.3.6-1.el7.x86_64.rpm/download.rpm
wget --content-disposition https://packagecloud.io/Altinity/clickhouse/packages/el/7/clickhouse-server-common-19.16.3.6-1.el7.x86_64.rpm/download.rpm

# 2 安装
rpm -ivh clickhouse-*-19.16.3.6-1.el7.x86_64.rpm

ClickHouse Server服务的启停命令如下。我们可以先启动ClickHouse服务。

# 1 启动。
# 可以在/var/log/clickhouse-server/目录中查看日志。
#sudo /etc/init.d/clickhouse-server start
systemctl start clickhouse-server

# 2 查看状态
systemctl status clickhouse-server

# 3 重启
systemctl restart clickhouse-server

# 4 关闭
systemctl stop clickhouse-server

启动ClickHouse Client服务,验证是否安装成功,如果可以正常执行,那么我们就可以快速去开始体验ClickHouse了。

# 1 未设置密码时
clickhouse-client
-- 2 执行一个简单的SQL。可以正常解析并执行。
cdh1 :) SELECT 1;
SELECT 1
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─1─┐
│ 1 │
└───┘

3.3 集群方式

例如现在准备在三个节点(cdh1、cdh2、cdh3)的机器上安装部署ClickHouse,CentOS 7系统的防火墙和SELINUX已经关闭或禁止或端口已开放。集群的方需要依赖ZooKeeper服务,因此先要保证ZooKeeper服务正常启动,剩余的安装方式和单节点差不多,只不过需要添加一个集群形式的配置文件。

先在cdh1、cdh2、cdh3三个节点都通过rpm方式安装ClickHouse服务。先在cdh1节点配置/etc/metrika.xml(需要自己创建),这个文件主要将ClickHouse各个服务的host和port、ZooKeeper集群的各个节点配置到文件中。cdh2和cdh3也同样配置,只不过需要将标签下的标签中的值改为自己节点的主机名或者ip。

<yandex>
	
    <clickhouse_remote_servers>
	
        <perftest_3shards_1replicas>
		
            <shard>
                <internal_replication>trueinternal_replication>
                <replica>
                    <host>cdh1host>
                    <port>9000port>
                replica>
            shard>
            
            <shard>
                <replica>
                    <internal_replication>trueinternal_replication>
                    <host>cdh2host>
                    <port>9000port>
                replica>
            shard>
            
            <shard>
                <internal_replication>trueinternal_replication>
                <replica>
                    <host>cdh3host>
                    <port>9000port>
                replica>
            shard>
        perftest_3shards_1replicas>
    clickhouse_remote_servers>

    
    <zookeeper-servers>
        <node index="1">
            <host>cdh1host>
            <port>2181port>
        node>
        <node index="2">
            <host>cdh2host>
            <port>2181port>
        node>
        <node index="3">
            <host>cdh3host>
            <port>2181port>
        node>
    zookeeper-servers>
    
    <macros>
        <replica>cdh1replica>
    macros>
    
    <networks>
        <ip>::/0ip>
    networks>
    
    <clickhouse_compression>
        <case>
            <min_part_size>10000000000min_part_size>
            <min_part_size_ratio>0.01min_part_size_ratio>
            <method>lz4method>
        case>
    clickhouse_compression>

yandex>

因为集群之间需要互相方位其它节点的服务,需要开放ClickHouse服务的ip和端口,在cdh1、cdh2、cdh3三个机器上配置/etc/clickhouse-server/config.xml文件,在标签下释放 标签(大概在69、70行),配置如下。

<yandex>
    
    
    <listen_host>::listen_host>

    
    <timezone>Asia/Shanghaitimezone>

    
    <include_from>/etc/clickhouse-server/metrika.xmlinclude_from>

    
    
yandex>

为了不让ClickHouse裸奔,现在我们配置一下用户认证部分。密码配置有两种方式,一种是明文方式,一种是密文方式(sha256sum的Hash值),官方推荐使用密文作为密码配置,在cdh1、cdh2、cdh3三个机器上配置/etc/clickhouse-server/users.xml文件。用户名和密码的配置主要是在标签中,下面的配置文件中配置了两个用户,一个是默认用户default,就是如果未指明用户时默认使用的用户,其密码配置的为sha256密文方式,第二个用户是ck,为一个只读用户,即只能查看数据,无法建表修改数据等操作,其密码直接采用的明文方式进行配置。


<yandex>
    
    <profiles>
        
        <default>
            <max_memory_usage>10000000000max_memory_usage>
            <use_uncompressed_cache>0use_uncompressed_cache>
            <load_balancing>randomload_balancing>
        default>
        
        <readonly>
            <readonly>1readonly>
        readonly>
    profiles>

    
    <users>
        <default>
            
	<password_sha256_hex>abb23878df2926d6863ca539f78f4758722966196e8f918cd74d8c11e95dc8aepassword_sha256_hex>
            <networks incl="networks" replace="replace">
                <ip>::/0ip>
            networks>

            
            <profile>defaultprofile>

            
            <quota>defaultquota>

            
            <databases>
                <test>
                    
                    <filtered_table1>
                        <filter>a = 1filter>
                    filtered_table1>

                    
                    <filtered_table2>
                        <filter>a + b < 1 or c - d > 5filter>
                    filtered_table2>

                    
                    <filtered_table3>
                        <filter>c = 1filter>
                    filtered_table3>
                test>
            databases>
        default>
        
        <ck>
            <password>123456password>
            <networks incl="networks" replace="replace">
                <ip>::/0ip>
            networks>
            <profile>readonlyprofile>
            <quota>defaultquota>
        ck>
    users>

    
    <quotas>
        
        <default>
            
            <interval>
                
                <duration>3600duration>

                
                <queries>0queries>
                <errors>0errors>
                <result_rows>0result_rows>
                <read_rows>0read_rows>
                <execution_time>0execution_time>
            interval>
        default>
    quotas>
yandex>

其中生成sha256sum的Hash值可以执行如下命令(第一行),回车后输出两行信息(第二行和第三行),其中第二行是原始密码,第三行是加密的密文,配置文件使用第三行的字符串,客户端登陆是使用第二行的密码。

PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
KavrqeN1
abb23878df2926d6863ca539f78f4758722966196e8f918cd74d8c11e95dc8ae

由于配置了密码,因此需要让集群的每个节点也知道每个节点的密码,因此在上面的/etc/clickhouse-server/metrika.xml配置文件的分片中增加用户名和密码的配置,这里为了方便三个节点的密码配置的一样,也可以每个节点密码不一样。

		<perftest_3shards_1replicas>
            <shard>
                <internal_replication>trueinternal_replication>
                <replica>
                    <host>cdh1host>
                    <port>9000port>
                    <user>defaultuser>
                    <password>KavrqeN1password>
                replica>
            shard>
            
            <shard>
                <replica>
                    <internal_replication>trueinternal_replication>
                    <host>cdh2host>
                    <port>9000port>
                    <user>defaultuser>
                    <password>KavrqeN1password>
                replica>
            shard>
        
            <shard>
                <internal_replication>trueinternal_replication>
                <replica>
                    <host>cdh3host>
                    <port>9000port>
                    <user>defaultuser>
                    <password>KavrqeN1password>
                replica>
            shard>
        perftest_3shards_1replicas>

在三个节点的服务器上分别启动ClickHouse服务,执行如下命令。启动时请保证每个节点的9000端口未被占用(lsof -i:9000),如果占用请修改/etc/clickhouse-server/config.xml文件中的端口(9000),同时记得/etc/clickhouse-server/metrika.xml中的端口号也要统一。

# 启动服务
systemctl start clickhouse-server

# 查看服务状态。如果Active 显示的为 active,且信息中没有错误,则表示启动成功。
systemctl status clickhouse-server

4 客户端工具

4.1 clickhouse-client

# 1 未设置密码时
clickhouse-client

# 2 指定用户名和密码 
#    -h		指定
#    -u		指定用户,不指定则使用默认用户 default
#    --port  指定服务的端口。默认为9000,如果端口更给了,则需要指定端口号
#    --password    指定登陆用户的密码
#    --multiline	支持多行SQL,否则是单行,回车(换行)自动执行,也可以使用 \ 符多行输入。
#    --help   查看参数的帮助说明
clickhouse-client -h 127.0.0.1 -u default --password KavrqeN1
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1  --multiline
# 指定sql命令方式
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1  --multiline -q "SELECT now()" 
-- 查看集群信息
cdh3 :) SELECT * FROM system.clusters;
SELECT *
FROM system.clusters
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─cluster─────────────────┬─shard_num─┬─shard_weight─┬─replica_num─┬─host_name─┬─host_address────┬──port─┬─is_local─┬─user────┬─default_database─┬─errors_count─┬─estimated_recovery_time─┐
│ perftest_3shards_1replicas │         111 │ cdh1      │ 192.168.33.390000default │                  │            00 │
│ perftest_3shards_1replicas │         211 │ cdh2      │ 192.168.33.690000default │                  │            00 │
│ perftest_3shards_1replicas │         311 │ cdh3      │ 192.168.33.990001default │                  │            00 │
└────────────────────────────┴───────────┴──────────────┴─────────────┴───────────┴──────────────┴───────┴──────────┴─────────┴──────────────────┴──────────────┴─────────────────────────┘
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↖ Progress: 3.00 rows, 360.00 B (1.36 thousand rows/s., 163.76 KB/s.)
3 rows in set. Elapsed: 0.003 sec.

4.2 DBeaver

  • 新建连接
  • All(或者Analytical),选择ClickHouse,下一步
  • 端口默认是8123,主机选择ClickHouse的Server节点(如果是集群,随意一个ClickHouse 服务节点都行)。填写用户认证处设置用户名和密码。
  • 测试连接,会提示下载驱动,确认下载即可。

查看ClickHouse集群信息,在DBeaver中执行如下SQL。可以看到集群的分片、分片表示序号、host名字、端口号、用户名等信息。

4.3 JDBC

项目中引入依赖

 <dependency>
    <groupId>ru.yandex.clickhousegroupId>
    <artifactId>clickhouse-jdbcartifactId>
    <version>0.2version>
dependency>

Scala代码如下:

package yore

import java.sql.{Connection, DriverManager, ResultSet, ResultSetMetaData, Statement}

/**
  * 数据集可以通过执行这个脚本获取:
  * https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh
  *
  *
  *
  * Created by yore on 2019/11/4 17:25
  */
object JdbcClient {
  private val address =  "jdbc:clickhouse://cdh3:8123/default"


  def main(args: Array[String]): Unit = {
    Class.forName("ru.yandex.clickhouse.ClickHouseDriver")

    var connection: Connection = null
    var statement: Statement = null
    var results: ResultSet = null

    var sql = "SELECT COUNT(*) FROM part "
    sql = "SELECT * FROM lineorder ORDER BY LO_COMMITDATE DESC LIMIT 10"
    sql =
      """
        |SELECT
        |   min(Year), max(Year), Carrier, count(*) AS cnt,
        |   sum(ArrDelayMinutes>30) AS flights_delayed,
        |   round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
        |FROM ontime
        |WHERE
        |   DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
        |   AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
        |   AND FlightDate < '2010-01-01'
        |GROUP by Carrier
        |HAVING cnt>100000 and max(Year)>1990
        |ORDER by rate DESC
        |LIMIT 1000;
      """.stripMargin


    try{
      // 用户名和密码就是前面在/etc/clickhouse-server/users.xml 中配置的
      connection = DriverManager.getConnection(address, "ck", "123456")
      statement = connection.createStatement

      val begin = System.currentTimeMillis
      results = statement.executeQuery(sql)
      val end = System.currentTimeMillis

      val rsmd: ResultSetMetaData = results.getMetaData()
      for(i <- 1 to rsmd.getColumnCount){
        print(s"${rsmd.getColumnName(i)}\t")
      }
      println()

      while(results.next()){
        for(i <- 1 to rsmd.getColumnCount){
          print(s"${results.getString(rsmd.getColumnName(i))}\t")
        }
        println()
      }
      println(s"${"-"*30}\n执行${sql} 耗时${end-begin} ms")
    }catch {
      case e: Exception => e.printStackTrace()
    }
  }
}

执行上述查询后控制台输出的结果如下:

min(Year)	max(Year)	Carrier	cnt	flights_delayed	rate	
2003	2009	EV	1454777	237698	0.16	
2003	2009	B6	683874	103677	0.15	
2003	2009	FL	1082489	158748	0.15	
2006	2009	YV	740608	110389	0.15	
2006	2009	XE	1016010	152431	0.15	
2003	2005	DH	501056	69833	0.14	
2001	2009	MQ	3238137	448037	0.14	
2003	2006	RU	1007248	126733	0.13	
2004	2009	OH	1195868	160071	0.13	
1987	2009	UA	9655762	1203503	0.12	
2003	2006	TZ	136735	16496	0.12	
1987	2009	CO	6092575	681750	0.11	
1987	2009	AA	10678564	1189672	0.11	
1987	2001	TW	2693587	283362	0.11	
1987	2009	AS	1511966	147972	0.1	
1987	2009	NW	7648247	729920	0.1	
1987	2009	DL	11948844	1163924	0.1	
1988	2009	US	10229664	986338	0.1	
2007	2009	9E	577244	59440	0.1	
2003	2009	OO	2654259	257069	0.1	
1987	1991	PA	218938	20497	0.09	
2005	2009	F9	307569	28679	0.09	
1987	2005	HP	2628455	236633	0.09	
1987	2009	WN	12726332	1108072	0.09	
------------------------------
执行
SELECT
   min(Year), max(Year), Carrier, count(*) AS cnt,
   sum(ArrDelayMinutes>30) AS flights_delayed,
   round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
FROM ontime
WHERE
   DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
   AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
   AND FlightDate < '2010-01-01'
GROUP by Carrier
HAVING cnt>100000 and max(Year)>1990
ORDER by rate DESC
LIMIT 1000;
       耗时1801 ms

Process finished with exit code 0

5 数据类型

固定长度的整型,包括有符号整型或无符号整型。

  • 整型范围

    • Int8 → [-128 : 127]
    • Int16 → [-32768 : 32767]
    • Int32 → [-2147483648 : 2147483647]
    • Int64 → [-9223372036854775808 : 9223372036854775807]
  • 无符号整型范围

    • UInt8 → [0 : 255]
    • UInt16 → [0 : 65535]
    • UInt32 → [0 : 4294967295]
    • UInt64 → [0 : 18446744073709551615]
  • 浮点型

    • Float32 → 对应于float
    • Float64 → 对应于double
  • 小数型

    • Decimal(P, S) → P :精度,有效范围:[1:38],决定可以有多少个十进制数字(包括分数)。S :规模,有效范围[0:P],决定数字的小数部分中包含的小数位数。
    • Decimal32(S) → 等效于p从1到9
    • Decimal64(S) → 等效于p从10到18
    • Decimal128(S) → 等效于p从19到38
  • Boolean。不支持,可以使用UInt8类型,用1或0表示

  • 字符串类型

    • String → 字符串可以任意长度的。它可以包含任意的字节集,包含空字节。ClickHouse 没有编码的概念。字符串可以是任意的字节集,按它们原本的方式进行存储和输出。
    • FixedString(N) → 固定长度 N 的字符串(N 必须是严格的正自然数)。
    • UUID → 专门用户保存UUID类型的值,格式如00000000-0000-0000-0000-000000000000,Clickhouse自带函数generateUUIDv4()可生成
  • 日期

    • Date → 日期类型,用两个字节存储,表示从 1970-01-01 (无符号) 到当前的日期值。允许存储从 Unix 纪元开始到编译阶段定义的上限阈值常量(目前上限是2106年,但最终完全支持的年份为2105)。最小值输出为0000-00-00。日期中没有存储时区信息。
    • DateTime → 时间戳类型。用四个字节(无符号的)存储 Unix 时间戳)。允许存储与日期类型相同的范围内的值。Unix中的值范围为[1970-01-01 00:00:00, 2105-12-31 23:59:59]。默认情况下,客户端连接到服务的时候会使用服务端时区。您可以通过启用客户端命令行选项 --use_client_time_zone来设置使用客户端时间。
  • 其他

    • Enum8 → 用 'String'= Int8 对描述。例如 x Enum8('hello' = 1, 'world' = 2)
    • Enum16 → 用'String'= Int16 对描述。
    • Array[T]
    • Tuple(T1, T2, …)

6 表引擎

  • MergeTree:适用于高负载任务的最通用和功能最强大的表引擎。这些引擎的共同特点是可以快速插入数据并进行后续的后台数据处理。MergeTree系列引擎支持数据复制(使用Replicated* 的引擎版本),分区和一些其他引擎不支持的其他功能。
  • Log:具有最小功能的轻量级引擎。当您需要快速写入许多小表(最多约100万行)并在以后整体读取它们时,该类型的引擎是最有效的。
  • Intergation engines
    • HDFSENGINE=HDFS('hdfs://hdfs1:9000/other_storage', 'TSV')
    • MySQL
    • Kafka:
    • JDBC:
    • ODBC:
  • 其他
    • Distributed
    • MaterializedView
    • Dictionary
    • Merge
    • File
    • Null
    • Set
    • Join
    • URL
    • View
    • Memory
    • Buffer

7 数据库引擎

默认情况下,ClickHouse使用自己的数据库引擎,但它同时支持MySQL数据库引擎,这种方式将远程的MySQL服务器中的表映射到ClickHouse中,并允许您对表进行INSERTSELECT查询,以方便您在ClickHouse与MySQL之间进行数据交换。MySQL数据库引擎会将对其的查询转换为MySQL语法并发送到MySQL服务器中,因此我们可以执行诸如SHOW TABLES或SHOW CREATE TABLE之类的操作。但是通过这种方式不能对数据进行:ATTACH/DETACHDROPRENAMECREATE TABLEALTER

下面我们通过ClickHouse的MySQL数据库引擎来查询MySQL中的一份数据来演示,其中一个表的数据大概有2kw。通过这个演示感受一下ClickHouse的速度。

-- 1 在ClickHouse中创建MySQL类型的数据库,同时与MySQL服务器交换数据:
cdh3 :) CREATE DATABASE IF NOT EXISTS flink_test
:-] ENGINE = MySQL('cdh1:3306', 'flink_test', 'scm', 'scm');
CREATE DATABASE IF NOT EXISTS flink_test
ENGINE = MySQL('cdh1:3306', 'flink_test', 'scm', 'scm')
Ok.
0 rows in set. Elapsed: 0.004 sec.

-- 2 查看库
cdh3 :) SHOW DATABASES;
SHOW DATABASES
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─name───────┐
│ default    │
│ flink_test │
│ system     │
└────────────┘
↘ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↓ Progress: 3.00 rows, 290.00 B (528.24 rows/s., 51.06 KB/s.)
3 rows in set. Elapsed: 0.006 sec.

-- 3 查看 flink_test 库中的表。此时在ClickHouse中便可以看到MySQL中的表。其它未用到的表已省略
cdh3 :) SHOW TABLES FROM flink_test;
SHOW TABLES FROM flink_test
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─name───────────────────┐
│ vote_recordss_memory   │
│ w3                     │
└────────────────────────┘
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↖ Progress: 17.00 rows, 661.00 B (1.95 thousand rows/s., 75.99 KB/s.)
17 rows in set. Elapsed: 0.009 sec.

-- 4 选择库
cdh3 :) USE flink_test;
USE flink_test
Ok.
0 rows in set. Elapsed: 0.005 sec.

-- 5 插入数据(表名区分大小写)
cdh3 :) INSERT INTO w3 VALUES(3, 'Mercury');
INSERT INTO w3 VALUES
Ok.
1 rows in set. Elapsed: 0.022 sec.

-- 6 查询数据。数据插入后不支持删除和更新。
cdh3 :) SELECT * FROM w3;
SELECT *
FROM w3
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─id─┬─f1──────┐
│  3 │ Mercury │
│  5 │ success │
└────┴─────────┘
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↑ Progress: 2.00 rows, 42.00 B (202.58 rows/s., 4.25 KB/s.)
2 rows in set. Elapsed: 0.010 sec.

-- 7 查看 MySQL 和 ClickHouse 对数据的聚合能力
--  7.1 MySQL。可以看到在MySQL中统计一张将近2千万数据量的表花费了 29.54 秒
mysql> SELECT COUNT(*) FROM vote_recordss_memory;
+----------+
| COUNT(*) |
+----------+
| 19999998 |
+----------+
1 row in set (29.54 sec)

--  7.2 ClickHouse 中执行一次COUNT,花费了 9.713 秒
cdh3 :) SELECT COUNT(*) FROM vote_recordss_memory;
SELECT COUNT(*)
FROM vote_recordss_memory
↘ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↓ Progress: 131.07 thousand rows, 131.07 KB (1.14 million rows/s., 1.14 MB/s.) ↙ Progress: 327.68 thousand rows, 327.68 KB (1.52 million rows/s., 1.52 MB/s.) ← Progress: 524.29 thousand rows, 524.29 KB (1.66 millio ┌──COUNT()─┐
│ 19999998 │
└──────────┘
↓ Progress: 19.79 million rows, 19.79 MB (2.04 million rows/s., 2.04 MB/s.) ↙ Progress: 20.00 million rows, 20.00 MB (2.06 million rows/s., 2.06 MB/s.)
1 rows in set. Elapsed: 9.713 sec. Processed 20.00 million rows, 20.00 MB (2.06 million rows/s., 2.06 MB/s.)

-- 7.3 在查询时指定mysql的连接、库名、表名、登陆信息,等价于上面的SQL。
cdh3 :) SELECT COUNT(*) FROM  mysql('cdh1:3306', 'flink_test', 'vote_recordss_memory', 'root', '123456');

-- 8 使用 ClickHouse 的 MergeTree 表引擎
--  8.1 切换到 ClickHouse 默认库下
cdh1 :) USE default;
USE default
Ok.
0 rows in set. Elapsed: 0.007 sec.

--  8.2 创建表并指定 MergeTree 表引擎,将MySQL数据加载进来,同时指定排序规则主键值为准
cdh1 :) CREATE TABLE vote_recordss
:-] ENGINE = MergeTree--(id, create_time)
:-] ORDER BY id AS
:-] SELECT * FROM mysql('cdh1:3306', 'flink_test', 'vote_recordss_memory', 'root', '123456');
CREATE TABLE vote_recordss
ENGINE = MergeTree
ORDER BY id AS
SELECT *
FROM mysql('cdh1:3306', 'flink_test', 'vote_recordss_memory', 'root', '123456')
↖ Progress: 65.54 thousand rows, 3.01 MB (299.97 thousand rows/s., 13.80 MB/s.) ↑ Progress: 131.07 thousand rows, 6.03 MB (411.12 thousand rows/s., 18.91 MB/s.) ↗ Progress: 196.61 thousand rows, 9.04 MB (468.88 thousand rows/s., 21.57 MB/s.) → Progress: 262.14 thousand  Ok.
0 rows in set. Elapsed: 27.917 sec. Processed 20.00 million rows, 920.00 MB (716.40 thousand rows/s., 32.95 MB/s.)

--  8.3 查询。可以看到是count某个值的速度速度约为MySQL的2950倍
cdh1 :) SELECT COUNT(*) FROM vote_recordss;
SELECT COUNT(*)
FROM vote_recordss
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌──COUNT()─┐
│ 19999998 │
└──────────┘
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↖ Progress: 20.00 million rows, 80.00 MB (2.26 billion rows/s., 9.06 GB/s.)  98%
1 rows in set. Elapsed: 0.009 sec. Processed 20.00 million rows, 80.00 MB (2.20 billion rows/s., 8.79 GB/s.)

--  8.4 去重。可以看到ClickHouse速度约为MySQL的94倍
mysql> SELECT DISTINCT group_id from vote_recordss_memory ;
+----------+
| group_id |
+----------+
|        1 |
|        2 |
|        0 |
+----------+
3 rows in set (12.79 sec)
--  ClickHouse中执行
cdh1 :) SELECT DISTINCT group_id from vote_recordss;
SELECT DISTINCT group_id
FROM vote_recordss
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─group_id─┐
│        0 │
│        2 │
│        1 │
└──────────┘
↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) → Progress: 19.04 million rows, 76.17 MB (145.18 million rows/s., 580.70 MB/s.)  94%↘ Progress: 20.00 million rows, 80.00 MB (147.97 million rows/s., 591.87 MB/s.)  98%
3 rows in set. Elapsed: 0.136 sec. Processed 20.00 million rows, 80.00 MB (147.44 million rows/s., 589.76 MB/s.)

--  8.5 分组统计。可以看到ClickHouse速度约为MySQL的94倍
mysql> SELECT SUM(vote_num),group_id from vote_recordss_memory GROUP BY group_id;
+---------------+----------+
| SUM(vote_num) | group_id |
+---------------+----------+
|   33344339689 |        0 |
|   33315889226 |        1 |
|   33351509121 |        2 |
+---------------+----------+
3 rows in set (16.26 sec)
--  ClickHouse中执行
cdh1 :)  SELECT SUM(vote_num),group_id from vote_recordss GROUP BY group_id;
SELECT
    SUM(vote_num),
    group_id
FROM vote_recordss
GROUP BY group_id
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 11.43 million rows, 91.42 MB (101.40 million rows/s., 811.20 MB/s.)  56%┌─SUM(vote_num)─┬─group_id─┐
│   333443396890 │
│   333515091212 │
│   333158892261 │
└───────────────┴──────────┘
↖ Progress: 11.43 million rows, 91.42 MB (66.08 million rows/s., 528.64 MB/s.)  56%↑ Progress: 20.00 million rows, 160.00 MB (115.61 million rows/s., 924.84 MB/s.)  98%
3 rows in set. Elapsed: 0.173 sec. Processed 20.00 million rows, 160.00 MB (115.56 million rows/s., 924.45 MB/s.)

--  8.6 排序取TOP 10。可以看到ClickHouse速度约为MySQL的25倍
mysql> SELECT * FROM vote_recordss_memory ORDER BY create_time DESC,vote_num LIMIT 10;
+----------+----------------------+----------+----------+--------+---------------------+
| id       | user_id              | vote_num | group_id | status | create_time         |
+----------+----------------------+----------+----------+--------+---------------------+
| 19999993 | 4u6PJYvsDD4khghreFvm |     2388 |        0 |      1 | 2019-10-15 01:00:20 |
| 19999998 | shTrosZpT5zux3wiKH5a |     4991 |        2 |      1 | 2019-10-15 01:00:20 |
| 19999995 | xRwQuMgQeuBoXvsBusFO |     6737 |        2 |      1 | 2019-10-15 01:00:20 |
| 19999996 | 5QNgMYoQUSsuX7Aqarw8 |     7490 |        2 |      2 | 2019-10-15 01:00:20 |
| 19999997 | eY12Wq9iSm0MH1PUTChk |     7953 |        0 |      2 | 2019-10-15 01:00:20 |
| 19999994 | ZpS0dWRm1TdhzTxTHCSj |     9714 |        0 |      1 | 2019-10-15 01:00:20 |
| 19999946 | kf7FOTUHAICP5Mv2xodI |       32 |        2 |      2 | 2019-10-15 01:00:19 |
| 19999738 | ER90qVc4CJCKH5bxXYTo |       57 |        1 |      2 | 2019-10-15 01:00:19 |
| 19999810 | gJHbBkGf0bJViwy5BB2d |      190 |        1 |      2 | 2019-10-15 01:00:19 |
| 19999977 | Wq7bogXRiHubhFlAHBJH |      208 |        0 |      2 | 2019-10-15 01:00:19 |
+----------+----------------------+----------+----------+--------+---------------------+
10 rows in set (15.31 sec)
--  ClickHouse中执行
cdh1 :)  SELECT * FROM vote_recordss ORDER BY create_time DESC,vote_num LIMIT 10;
SELECT *
FROM vote_recordss
ORDER BY
    create_time DESC,
    vote_num ASC
LIMIT 10
↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) → Progress: 2.34 million rows, 107.77 MB (21.21 million rows/s., 975.60 MB/s.)  11%↘ Progress: 5.31 million rows, 244.19 MB (24.97 million rows/s., 1.15 GB/s.)  26%↓ Progress: 8.75 million rows, 402.46 MB (27.97 mi%┌───────id─┬─user_id──────────────┬─vote_num─┬─group_id─┬─status─┬─────────create_time─┐
│ 199999934u6PJYvsDD4khghreFvm │     2388012019-10-15 01:00:20 │
│ 19999998 │ shTrosZpT5zux3wiKH5a │     4991212019-10-15 01:00:20 │
│ 19999995 │ xRwQuMgQeuBoXvsBusFO │     6737212019-10-15 01:00:20 │
│ 199999965QNgMYoQUSsuX7Aqarw8 │     7490222019-10-15 01:00:20 │
│ 19999997 │ eY12Wq9iSm0MH1PUTChk │     7953022019-10-15 01:00:20 │
│ 19999994 │ ZpS0dWRm1TdhzTxTHCSj │     9714012019-10-15 01:00:20 │
│ 19999946 │ kf7FOTUHAICP5Mv2xodI │       32222019-10-15 01:00:19 │
│ 19999738 │ ER90qVc4CJCKH5bxXYTo │       57122019-10-15 01:00:19 │
│ 19999810 │ gJHbBkGf0bJViwy5BB2d │      190122019-10-15 01:00:19 │
│ 19999977 │ Wq7bogXRiHubhFlAHBJH │      208022019-10-15 01:00:19 │
└──────────┴──────────────────────┴──────────┴──────────┴────────┴─────────────────────┘
↖ Progress: 16.65 million rows, 766.10 MB (27.46 million rows/s., 1.26 GB/s.)  82%↑ Progress: 20.00 million rows, 920.00 MB (32.98 million rows/s., 1.52 GB/s.)  98%
10 rows in set. Elapsed: 0.607 sec. Processed 20.00 million rows, 920.00 MB (32.93 million rows/s., 1.51 GB/s.)

在不同的查询查询场景下,ClickHouse都要比MySQL快很多,整个查询ClickHouse均能控制在1秒内,这个给人的印象实在太深刻了,是不是有种,放开MySQL吧~,专业的事情让专业的数据库来做吧。


8 示例

示例部分我们主要分析官方提供的一份航班数据集,这份数据有109个字段,记录了1987年到2018年之间的比较详细的航班信息,例如我们可以从这份数据中获取每天的航班次数、每周延误超过10分钟的航班数、各航空公司延误超过10分钟的次数等等。

8.1 获取数据

获取航班数据集的方式有两种,第一种方式通过下载每年每月份的CSV文件,然后倒入ClickHouse数据库,这种方式大约需要下载374个zip包,共6.6GB。第二种方式是直接下载下载预处理好的分区数据,解压到ClickHouse的数据目录下(/var/lib/clickhouse),重启ClickHouse服务即可查询到新倒入的数据,这种方式需要下载大约16GB的数据。

第一种方式可能会由于网络原因有些文件下载时有损坏,导入时如果报错,需要重新下载那年那月的数据文件,下载时间较长。采用第二种方式虽然数据文件较大,但是下载过程只有一个文件,下载相对快些。这里为了预先演示csv文件的导入过程,我们采用第一种方式获取数据,并导入到数据库。

执行如下脚本文件,迭代获取1987年到2018年的数据集文件。

for s in `seq 1987 2018`
do
for m in `seq 1 12`
do
wget https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${s}_${m}.zip
done
done

8.2 插入数据

# 1 登录ClickHouse Client,在cdh3节点登录,因为下载的数据文件在cdh3节点
#  为了支持多行SQL,指定参数 --multiline
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1  --multiline
-- 2 在cdh3节点上创建表,主要用来导入数据时使用
cdh3 :) CREATE TABLE `ontime` (
  `Year` UInt16,
  `Quarter` UInt8,
  `Month` UInt8,
  `DayofMonth` UInt8,
  `DayOfWeek` UInt8,
  `FlightDate` Date,
  `UniqueCarrier` FixedString(7),
  `AirlineID` Int32,
  `Carrier` FixedString(2),
  `TailNum` String,
  `FlightNum` String,
  `OriginAirportID` Int32,
  `OriginAirportSeqID` Int32,
  `OriginCityMarketID` Int32,
  `Origin` FixedString(5),
  `OriginCityName` String,
  `OriginState` FixedString(2),
  `OriginStateFips` String,
  `OriginStateName` String,
  `OriginWac` Int32,
  `DestAirportID` Int32,
  `DestAirportSeqID` Int32,
  `DestCityMarketID` Int32,
  `Dest` FixedString(5),
  `DestCityName` String,
  `DestState` FixedString(2),
  `DestStateFips` String,
  `DestStateName` String,
  `DestWac` Int32,
  `CRSDepTime` Int32,
  `DepTime` Int32,
  `DepDelay` Int32,
  `DepDelayMinutes` Int32,
  `DepDel15` Int32,
  `DepartureDelayGroups` String,
  `DepTimeBlk` String,
  `TaxiOut` Int32,
  `WheelsOff` Int32,
  `WheelsOn` Int32,
  `TaxiIn` Int32,
  `CRSArrTime` Int32,
  `ArrTime` Int32,
  `ArrDelay` Int32,
  `ArrDelayMinutes` Int32,
  `ArrDel15` Int32,
  `ArrivalDelayGroups` Int32,
  `ArrTimeBlk` String,
  `Cancelled` UInt8,
  `CancellationCode` FixedString(1),
  `Diverted` UInt8,
  `CRSElapsedTime` Int32,
  `ActualElapsedTime` Int32,
  `AirTime` Int32,
  `Flights` Int32,
  `Distance` Int32,
  `DistanceGroup` UInt8,
  `CarrierDelay` Int32,
  `WeatherDelay` Int32,
  `NASDelay` Int32,
  `SecurityDelay` Int32,
  `LateAircraftDelay` Int32,
  `FirstDepTime` String,
  `TotalAddGTime` String,
  `LongestAddGTime` String,
  `DivAirportLandings` String,
  `DivReachedDest` String,
  `DivActualElapsedTime` String,
  `DivArrDelay` String,
  `DivDistance` String,
  `Div1Airport` String,
  `Div1AirportID` Int32,
  `Div1AirportSeqID` Int32,
  `Div1WheelsOn` String,
  `Div1TotalGTime` String,
  `Div1LongestGTime` String,
  `Div1WheelsOff` String,
  `Div1TailNum` String,
  `Div2Airport` String,
  `Div2AirportID` Int32,
  `Div2AirportSeqID` Int32,
  `Div2WheelsOn` String,
  `Div2TotalGTime` String,
  `Div2LongestGTime` String,
  `Div2WheelsOff` String,
  `Div2TailNum` String,
  `Div3Airport` String,
  `Div3AirportID` Int32,
  `Div3AirportSeqID` Int32,
  `Div3WheelsOn` String,
  `Div3TotalGTime` String,
  `Div3LongestGTime` String,
  `Div3WheelsOff` String,
  `Div3TailNum` String,
  `Div4Airport` String,
  `Div4AirportID` Int32,
  `Div4AirportSeqID` Int32,
  `Div4WheelsOn` String,
  `Div4TotalGTime` String,
  `Div4LongestGTime` String,
  `Div4WheelsOff` String,
  `Div4TailNum` String,
  `Div5Airport` String,
  `Div5AirportID` Int32,
  `Div5AirportSeqID` Int32,
  `Div5WheelsOn` String,
  `Div5TotalGTime` String,
  `Div5LongestGTime` String,
  `Div5WheelsOff` String,
  `Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);

-- 3 查看创建的表
cdh3 :) SHOW TABLES;
SHOW TABLES
↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─name───┐
│ ontime │
└────────┘
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↘ Progress: 1.00 rows, 31.00 B (357.34 rows/s., 11.08 KB/s.)
1 rows in set. Elapsed: 0.003 sec.

-- 4 也可以对创建的表进行修改名字
RENAME TABLE ontime TO ontime_local;

下面开始导入数据,这里为了查看整个导入数据执行的时间,我这里写了一个小脚本,执行这个脚本开始导入数据。将下面的脚本保存到cdh3节点的下载的数据文件的目录下,例如保存为 load_to_ontime.sh,赋予脚本文件执行权限chmod +x load_to_ontime.sh,并执行 ./load_to_ontime.sh。因为中间有加压的过程,导入会需要一段时间,这里导入的过程大概花费了1933045 毫秒,共 184694329 条数据。

# 获取当前毫秒时间的函数
getMsec(){
	current=`date "+%Y-%m-%d %H:%M:%S"`     #获取当前时间,例:2019-03-25 11:09:17       
	timeStamp=`date -d "$current" +%s`      #将current转换为时间戳,精确到秒
	currentTimeStamp=$((timeStamp*1000+`date "+%N"`/1000000)) #将current转换为时间戳,精确到毫秒
}

getMsec
a=$currentTimeStamp
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query="INSERT INTO ontime FORMAT CSVWithNames"; done

getMsec
b=$currentTimeStamp
spendtime=$((b-a))

echo -e '\n ---------------------------- '
echo "共费时:$spendtime 毫秒"

导入完毕后,可以在cdh3节点查看插入的数据条数。并且可以看到查询的速度非常快,在7 数据库引擎 小节时MySQL中COUNT一下表的数据条数时,在2kw条数据时花费了近半分钟。

cdh3 :) SELECT COUNT(1) FROM ontime;
SELECT COUNT(1)
FROM ontime
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↑ Progress: 19.57 million rows, 19.57 MB (141.60 million rows/s., 141.60 MB/s.)  10%↗ Progress: 43.43 million rows, 43.43 MB (182.14 million rows/s., 182.14 MB/s.)  23%→ Progress: 63.75 million rows, 63.75 MB (188.%┌──COUNT(1)─┐
│ 184694329 │
└───────────┘
↖ Progress: 160.87 million rows, 160.87 MB (200.45 million rows/s., 200.45 MB/s.)  85%↑ Progress: 184.69 million rows, 184.69 MB (230.12 million rows/s., 230.12 MB/s.)  98%
1 rows in set. Elapsed: 0.803 sec. Processed 184.69 million rows, 184.69 MB (230.03 million rows/s., 230.03 MB/s.)

8.3 创建分布式表并加载数据

分布表(Distributed)本身不存储数据,相当于路由,在创建时需要指定集群名、数据库名、数据表名、分片KEY,这里分片用rand()函数,表示随机分片。查询分布式表会根据集群配置信息,路由到具体的数据表,再把结果进行合并。

# 1 创建一个分片的本地表
#  这里分别在 cdh1、cdh2、cdh3节点,执行如下进入 clickhouse client
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1  --multiline
-- 执行16.9.2节的 2步创建ontime表相似的建表语句,只是表名改为ontime_local 
CREATE TABLE `ontime_local` (
  …… 
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);

-- 2 创建分布式表。
--  ontime_all与ontime在同一个节点上,方便插入数据。
cdh3 :) CREATE TABLE ontime_all AS ontime_local
ENGINE = Distributed(perftest_3shards_1replicas, default, ontime_local, rand());

-- 3 查看cdh3 节点的表。
cdh3 :) SHOW TABLES;
SHOW TABLES
↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─name─────────┐
│ ontime       │
│ ontime_all   │
│ ontime_local │
└──────────────┘
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↘ Progress: 3.00 rows, 103.00 B (1.32 thousand rows/s., 45.45 KB/s.)
3 rows in set. Elapsed: 0.002 sec.

-- 4 插入数据到分布式表。
cdh3 :) INSERT INTO ontime_all SELECT * FROM ontime;
INSERT INTO ontime_all SELECT *
FROM ontime
→ Progress: 188.42 thousand rows, 136.97 MB (341.59 thousand rows/s., 248.32 MB/s.)0%%
↗ Progress: 90.92 million rows, 66.03 GB (322.35 thousand rows/s., 234.09 MB/s.) ██████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                48%Ok.
0 rows in set. Elapsed: 571.056 sec. Processed 184.69 million rows, 134.19 GB (323.43 thousand rows/s., 234.98 MB/s.)

-- 5 查看各个节点的分片表中的数据。
--  可以看到三个节点的分片数据之和等于 ontime 表的数据总数,也就是 ontime_all 表的数据总数
--  5.1 cdh1
cdh1 :) SELECT COUNT(*) FROM ontime_local;
SELECT COUNT(*)
FROM ontime_local
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 3.32 million rows, 3.32 MB (18.15 million rows/s., 18.15 MB/s.)  4%↖ Progress: 4.73 million rows, 4.73 MB (16.70 million rows/s., 16.70 MB/s.)  6%↑ Progress: 6.16 million rows, 6.16 MB (16.06 million ro%┌──COUNT()─┐
│ 61562643 │
└──────────┘
↘ Progress: 59.96 million rows, 59.96 MB (15.60 million rows/s., 15.60 MB/s.) ██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   87%
1 rows in set. Elapsed: 3.843 sec. Processed 61.56 million rows, 61.56 MB (16.02 million rows/s., 16.02 MB/s.)
--  5.2 cdh2
cdh2 :)  SELECT COUNT(*) FROM ontime_local;
SELECT COUNT(*)
FROM ontime_local
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 7.69 million rows, 7.69 MB (42.52 million rows/s., 42.52 MB/s.)  11%↖ Progress: 11.15 million rows, 11.15 MB (39.69 million rows/s., 39.69 MB/s.)  16%↑ Progress: 13.48 million rows, 13.48 MB (35.35 mill%┌──COUNT()─┐
│ 61555036 │
└──────────┘
↑ Progress: 60.19 million rows, 60.19 MB (16.99 million rows/s., 16.99 MB/s.) ██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   88%
1 rows in set. Elapsed: 3.543 sec. Processed 61.56 million rows, 61.56 MB (17.37 million rows/s., 17.37 MB/s.)
--  5.3 cdh3
cdh3 :)  SELECT COUNT(*) FROM ontime_local;
SELECT COUNT(*)
FROM ontime_local
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↗ Progress: 3.13 million rows, 3.13 MB (17.45 million rows/s., 17.45 MB/s.)  4%→ Progress: 4.56 million rows, 4.56 MB (16.34 million rows/s., 16.34 MB/s.)  6%↘ Progress: 6.12 million rows, 6.12 MB (16.13 million ro%┌──COUNT()─┐
│ 61576650 │
└──────────┘
← Progress: 59.97 million rows, 59.97 MB (15.55 million rows/s., 15.55 MB/s.) ██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   88%
1 rows in set. Elapsed: 3.858 sec. Processed 61.58 million rows, 61.58 MB (15.96 million rows/s., 15.96 MB/s.)

-- 6 查看分布式表的数据总数。
cdh3 :)  SELECT COUNT(*) FROM ontime_all;
SELECT COUNT(*)
FROM ontime_all
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↗ Progress: 3.37 million rows, 3.37 MB (18.02 million rows/s., 18.02 MB/s.)  4%→ Progress: 4.74 million rows, 4.74 MB (16.54 million rows/s., 16.54 MB/s.)  6%↘ Progress: 13.79 million rows, 13.79 MB (35.62 million %┌───COUNT()─┐
│ 184694329 │
└───────────┘
↑ Progress: 183.85 million rows, 183.85 MB (45.68 million rows/s., 45.68 MB/s.) ██████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   89%
1 rows in set. Elapsed: 4.026 sec. Processed 184.69 million rows, 184.69 MB (45.88 million rows/s., 45.88 MB/s.)

8.4 执行SQL

-- Q0 平均每月的航班记录数
cdh3 :) SELECT avg(c1) FROM (
:-]     SELECT Year, Month, count(*) AS c1
:-]     FROM ontime_all
:-]     GROUP BY Year, Month
:-] );
SELECT avg(c1)
FROM
(
    SELECT
        Year,
        Month,
        count(*) AS c1
    FROM ontime_all
    GROUP BY
        Year,
        Month
)
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↑ Progress: 28.47 million rows, 85.42 MB (147.53 million rows/s., 442.59 MB/s.)  41%↗ Progress: 83.67 million rows, 251.00 MB (282.10 million rows/s., 846.30 MB/s.)  61%→ Progress: 145.80 million rows, 437.41 MB (3%┌────────────avg(c1)─┐
│ 493835.10427807487 │
└────────────────────┘
↘ Progress: 145.80 million rows, 437.41 MB (308.61 million rows/s., 925.82 MB/s.)  71%↓ Progress: 184.69 million rows, 554.08 MB (390.67 million rows/s., 1.17 GB/s.)  90%
1 rows in set. Elapsed: 0.474 sec. Processed 184.69 million rows, 554.08 MB (389.56 million rows/s., 1.17 GB/s.)

-- Q1. 查询从2000年到2008年每天的航班数
cdh3 :) SELECT DayOfWeek, count(*) AS c
:-] FROM ontime_all
:-] WHERE Year>=2000 AND Year<=2008
:-] GROUP BY DayOfWeek
:-] ORDER BY c DESC;
SELECT
    DayOfWeek,
    count(*) AS c
FROM ontime_all
WHERE (Year >= 2000) AND (Year <= 2008)
GROUP BY DayOfWeek
ORDER BY c DESC
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─DayOfWeek─┬───────c─┐
│         58732422 │
│         18730614 │
│         48710843 │
│         38685626 │
│         28639632 │
│         78274367 │
│         67514194 │
└───────────┴─────────┘
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↑ Progress: 59.29 million rows, 177.86 MB (456.57 million rows/s., 1.37 GB/s.)  92%
7 rows in set. Elapsed: 0.130 sec. Processed 59.29 million rows, 177.86 MB (455.75 million rows/s., 1.37 GB/s.)

-- Q2. 查询从2000年到2008年每周延误超过10分钟的航班数。
cdh3 :) SELECT DayOfWeek, count(*) AS c
:-] FROM ontime_all
:-] WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
:-] GROUP BY DayOfWeek
:-] ORDER BY c DESC;
SELECT
    DayOfWeek,
    count(*) AS c
FROM ontime_all
WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
GROUP BY DayOfWeek
ORDER BY c DESC
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↗ Progress: 12.24 million rows, 56.74 MB (87.55 million rows/s., 405.92 MB/s.)  57%→ Progress: 38.83 million rows, 179.93 MB (161.84 million rows/s., 749.94 MB/s.)  60%┌─DayOfWeek─┬───────c─┐
│         52088300 │
│         41918325 │
│         11795120 │
│         71782292 │
│         31640798 │
│         21538291 │
│         61391984 │
└───────────┴─────────┘
↘ Progress: 38.83 million rows, 179.93 MB (121.54 million rows/s., 563.16 MB/s.)  60%↓ Progress: 59.29 million rows, 273.62 MB (185.56 million rows/s., 856.35 MB/s.)  92%
7 rows in set. Elapsed: 0.320 sec. Processed 59.29 million rows, 273.62 MB (185.52 million rows/s., 856.17 MB/s.)

-- Q3. 查询2000年到2008年每个机场延误超过10分钟以上的次数
cdh3 :) SELECT Origin, count(*) AS c
:-] FROM ontime_all
:-] WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
:-] GROUP BY Origin
:-] ORDER BY c DESC
:-] LIMIT 10;
SELECT
    Origin,
    count(*) AS c
FROM ontime_all
WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
GROUP BY Origin
ORDER BY c DESC
LIMIT 10
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↑ Progress: 18.71 million rows, 102.05 MB (125.90 million rows/s., 686.73 MB/s.)  87%↗ Progress: 49.15 million rows, 267.10 MB (197.87 million rows/s., 1.08 GB/s.)  76%┌─Origin─┬──────c─┐
│ ORD    │ 846692 │
│ ATL    │ 822955 │
│ DFW    │ 601318 │
│ LAX    │ 391247 │
│ PHX    │ 391191 │
│ LAS    │ 351713 │
│ DEN    │ 345108 │
│ EWR    │ 292916 │
│ DTW    │ 289233 │
│ IAH    │ 283861 │
└────────┴────────┘
→ Progress: 49.15 million rows, 267.10 MB (167.80 million rows/s., 911.83 MB/s.)  76%↘ Progress: 59.29 million rows, 322.25 MB (202.30 million rows/s., 1.10 GB/s.)  92%
10 rows in set. Elapsed: 0.293 sec. Processed 59.29 million rows, 322.25 MB (202.26 million rows/s., 1.10 GB/s.)

-- Q4. 查询2007年各航空公司延误超过10分钟以上的次数
cdh3 :) SELECT Carrier, count(*)
:-] FROM ontime_all
:-] WHERE DepDelay>10 AND Year=2007
:-] GROUP BY Carrier
:-] ORDER BY count(*) DESC;
SELECT
    Carrier,
    count(*)
FROM ontime_all
WHERE (DepDelay > 10) AND (Year = 2007)
GROUP BY Carrier
ORDER BY count(*) DESC
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─Carrier─┬─count()─┐
│ WN      │  296293 │
│ AA      │  176203 │
│ MQ      │  145630 │
│ US      │  135987 │
│ UA      │  128174 │
│ OO      │  127426 │
│ EV      │  101796 │
│ XE      │   99915 │
│ DL      │   93675 │
│ NW      │   90429 │
│ CO      │   76662 │
│ YV      │   67905 │
│ FL      │   59460 │
│ OH      │   59034 │
│ B6      │   50740 │
│ 9E      │   46948 │
│ AS42830 │
│ F9      │   23035 │
│ AQ      │    4299 │
│ HA      │    2746 │
└─────────┴─────────┘
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↗ Progress: 7.46 million rows, 37.14 MB (113.75 million rows/s., 566.67 MB/s.)  95%
20 rows in set. Elapsed: 0.066 sec. Processed 7.46 million rows, 37.14 MB (113.40 million rows/s., 564.90 MB/s.)

-- Old ANY INNER|RIGHT|FULL join在默认情况下是禁用的,需要开启才能执行下面的SQL
cdh3 :) Set any_join_distinct_right_table_keys=1;
SET any_join_distinct_right_table_keys = 1
Ok.
0 rows in set. Elapsed: 0.001 sec.
-- Q5. 查询2007年各航空公司延误超过10分钟以上的百分比
cdh3 :) SELECT Carrier, c, c2, c*100/c2 as c3 FROM (
:-]     SELECT Carrier, count(*) AS c FROM ontime_all
:-]     WHERE DepDelay>10
:-]         AND Year=2007
:-]     GROUP BY Carrier
:-] )ANY INNER JOIN(
:-]     SELECT Carrier, count(*) AS c2
:-]     FROM ontime_all
:-]     WHERE Year=2007
:-]     GROUP BY Carrier
:-] ) USING Carrier
:-] ORDER BY c3 DESC;
SELECT
    Carrier,
    c,
    c2,
    (c * 100) / c2 AS c3
FROM
(
    SELECT
        Carrier,
        count(*) AS c
    FROM ontime_all
    WHERE (DepDelay > 10) AND (Year = 2007)
    GROUP BY Carrier
)
ANY INNER JOIN
(
    SELECT
        Carrier,
        count(*) AS c2
    FROM ontime_all
    WHERE Year = 2007
    GROUP BY Carrier
) USING (Carrier)
ORDER BY c3 DESC
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 9.94 million rows, 42.20 MB (83.14 million rows/s., 352.96 MB/s.)  95%┌─Carrier─┬──────c─┬──────c2─┬─────────────────c3─┐
│ EV      │ 10179628623435.563909249075934 │
│ US      │ 13598748544728.012738774778708 │
│ AA      │ 17620363385727.798541311368336 │
│ MQ      │ 14563054049426.943869867195566 │
│ AS4283016018526.73783437899928 │
│ B6      │  5074019145026.503003395142336 │
│ UA      │ 12817449000226.15785241692891 │
│ WN      │ 296293116887125.348648396615197 │
│ OH      │  5903423603225.011015455531453 │
│ CO      │  7666232315123.72327487768876 │
│ F9      │  230359776023.562806873977088 │
│ YV      │  6790529436223.068534661403305 │
│ XE      │  9991543477322.98095787916913 │
│ FL      │  5946026315922.59470510223857 │
│ NW      │  9042941452621.81503693375084 │
│ OO      │ 12742659788021.31297250284338 │
│ DL      │  9367547588919.68421207466447 │
│ 9E      │  4694825885118.13707499681284 │
│ AQ      │   4299463609.273080241587575 │
│ HA      │   2746561754.888295505117935 │
└─────────┴────────┴─────────┴────────────────────┘
↖ Progress: 9.94 million rows, 42.20 MB (76.76 million rows/s., 325.89 MB/s.)  95%↑ Progress: 14.91 million rows, 66.96 MB (115.10 million rows/s., 516.88 MB/s.)  95%
20 rows in set. Elapsed: 0.130 sec. Processed 14.91 million rows, 66.96 MB (114.91 million rows/s., 516.05 MB/s.)
-- 另一个查询版本
cdh3 :) SELECT Carrier, avg(DepDelay>10)*100 AS c3
:-] FROM ontime_all
:-] WHERE Year=2007
:-] GROUP BY Carrier
:-] ORDER BY Carrier;
SELECT
    Carrier,
    avg(DepDelay > 10) * 100 AS c3
FROM ontime_all
WHERE Year = 2007
GROUP BY Carrier
ORDER BY Carrier ASC
↓ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ┌─Carrier─┬─────────────────c3─┐
│ 9E      │  18.13707499681284 │
│ AA      │ 27.798541311368336 │
│ AQ      │  9.273080241587575 │
│ AS26.73783437899928 │
│ B6      │ 26.503003395142333 │
│ CO      │  23.72327487768876 │
│ DL      │  19.68421207466447 │
│ EV      │ 35.563909249075934 │
│ F9      │ 23.562806873977088 │
│ FL      │  22.59470510223857 │
│ HA      │  4.888295505117935 │
│ MQ      │ 26.943869867195563 │
│ NW      │  21.81503693375084 │
│ OH      │ 25.011015455531453 │
│ OO      │  21.31297250284338 │
│ UA      │  26.15785241692891 │
│ US      │  28.01273877477871 │
│ WN      │ 25.348648396615197 │
│ XE      │ 22.980957879169132 │
│ YV      │ 23.068534661403305 │
└─────────┴────────────────────┘
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 7.46 million rows, 59.65 MB (73.51 million rows/s., 588.09 MB/s.)  95%
20 rows in set. Elapsed: 0.102 sec. Processed 7.46 million rows, 59.65 MB (73.40 million rows/s., 587.24 MB/s.)

--Q6. 同上一个查询一致,只是查询范围扩大到2000年到2008年
cdh3 :) SELECT Carrier, c, c2, c*100/c2 as c3 FROM (
:-]     SELECT Carrier, count(*) AS c
:-]     FROM ontime_all
:-]     WHERE DepDelay>10
:-]         AND Year>=2000 AND Year<=2008
:-]     GROUP BY Carrier
:-] )ANY INNER JOIN(
:-]     SELECT Carrier, count(*) AS c2
:-]     FROM ontime_all
:-]     WHERE Year>=2000 AND Year<=2008
:-]     GROUP BY Carrier
:-] ) USING Carrier
:-] ORDER BY c3 DESC;
SELECT
    Carrier,
    c,
    c2,
    (c * 100) / c2 AS c3
FROM
(
    SELECT
        Carrier,
        count(*) AS c
    FROM ontime_all
    WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
    GROUP BY Carrier
)
ANY INNER JOIN
(
    SELECT
        Carrier,
        count(*) AS c2
    FROM ontime_all
    WHERE (Year >= 2000) AND (Year <= 2008)
    GROUP BY Carrier
) USING (Carrier)
ORDER BY c3 DESC
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↖ Progress: 2.69 million rows, 10.77 MB (18.70 million rows/s., 74.79 MB/s.)  12%↑ Progress: 10.09 million rows, 40.34 MB (41.29 million rows/s., 165.18 MB/s.)  15%↗ Progress: 15.35 million rows, 61.40 MB (44.10 mi%┌─Carrier─┬───────c─┬──────c2─┬─────────────────c3─┐
│ EV      │  461050169717227.165779308166762 │
│ AS354145142718924.814162665211125 │
│ B6      │  19724981134124.311479390293353 │
│ FL      │  298916126513823.627145813342104 │
│ WN      │ 2165483928053923.33359086147906 │
│ YV      │  19878785405623.275640004870876 │
│ XE      │  233488103601522.537125427720643 │
│ MQ      │  876799395489522.169969114224273 │
│ UA      │ 1096646509463521.525506734044736 │
│ F9      │   7215033695821.412164127279958 │
│ DH      │  14704169304721.216598585665906 │
│ OH      │  301681146642120.572605002246966 │
│ HP      │  245293120856120.296286244550338 │
│ AA      │ 1276555631838620.203814708376473 │
│ US      │  909154465040019.550017202821262 │
│ TW      │   9480851185218.52254167220212 │
│ OO      │  556247309084917.99657634520483 │
│ CO      │  522219292529017.851871096540854 │
│ DL      │ 1050448591248617.76660443678006 │
│ 9E      │   8939152105917.155638804818647 │
│ NW      │  725076428004916.940834088581695 │
│ RU      │  216279131429416.455907125802902 │
│ TZ      │   3299820842015.832453699261107 │
│ AQ      │   1723915438111.166529559984713 │
│ HA      │   159682742655.822106356990502 │
└─────────┴─────────┴─────────┴────────────────────┘
↖ Progress: 108.85 million rows, 476.35 MB (65.70 million rows/s., 287.49 MB/s.) ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                           84%
25 rows in set. Elapsed: 1.657 sec. Processed 118.58 million rows, 522.95 MB (71.56 million rows/s., 315.58 MB/s.)
-- 另一个查询版本
cdh3 :) SELECT Carrier, avg(DepDelay>10)*100 AS c3
:-] FROM ontime_all
:-] WHERE Year>=2000 AND Year<=2008
:-] GROUP BY Carrier
:-] ORDER BY Carrier;
SELECT
    Carrier,
    avg(DepDelay > 10) * 100 AS c3
FROM ontime_all
WHERE (Year >= 2000) AND (Year <= 2008)
GROUP BY Carrier
ORDER BY Carrier ASC
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↖ Progress: 19.77 million rows, 158.15 MB (144.99 million rows/s., 1.16 GB/s.)  92%┌─Carrier─┬─────────────────c3─┐
│ 9E      │ 17.155638804818647 │
│ AA      │ 20.203814708376473 │
│ AQ      │ 11.166529559984713 │
│ AS24.814162665211125 │
│ B6      │ 24.311479390293353 │
│ CO      │ 17.851871096540854 │
│ DH      │ 21.216598585665906 │
│ DL      │  17.76660443678006 │
│ EV      │  27.16577930816676 │
│ F9      │ 21.412164127279958 │
│ FL      │   23.6271458133421 │
│ HA      │  5.822106356990502 │
│ HP      │ 20.296286244550338 │
│ MQ      │ 22.169969114224273 │
│ NW      │ 16.940834088581695 │
│ OH      │  20.57260500224697 │
│ OO      │  17.99657634520483 │
│ RU      │ 16.455907125802902 │
│ TW      │  18.52254167220212 │
│ TZ      │ 15.832453699261107 │
│ UA      │ 21.525506734044736 │
│ US      │ 19.550017202821262 │
│ WN      │  23.33359086147906 │
│ XE      │ 22.537125427720643 │
│ YV      │ 23.275640004870876 │
└─────────┴────────────────────┘
↑ Progress: 19.77 million rows, 158.15 MB (89.81 million rows/s., 718.53 MB/s.)  92%↗ Progress: 59.29 million rows, 474.31 MB (269.31 million rows/s., 2.15 GB/s.)  92%
25 rows in set. Elapsed: 0.220 sec. Processed 59.29 million rows, 474.31 MB (269.15 million rows/s., 2.15 GB/s.)

-- Q7. 每年航班延误超过10分钟的百分比
cdh3 :) SELECT Year, c1/c2 FROM (
:-]     select Year, count(*)*100 as c1
:-]     from ontime_all
:-]     WHERE DepDelay>10
:-]     GROUP BY Year
:-] )ANY INNER JOIN(
:-]     select Year, count(*) as c2
:-]     from ontime_all
:-]     GROUP BY Year
:-] ) USING (Year)
:-] ORDER BY Year;
SELECT
    Year,
    c1 / c2
FROM
(
    SELECT
        Year,
        count(*) * 100 AS c1
    FROM ontime_all
    WHERE DepDelay > 10
    GROUP BY Year
)
ANY INNER JOIN
(
    SELECT
        Year,
        count(*) AS c2
    FROM ontime_all
    GROUP BY Year
) USING (Year)
ORDER BY Year ASC
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↗ Progress: 15.13 million rows, 30.25 MB (65.64 million rows/s., 131.28 MB/s.)  22%→ Progress: 61.58 million rows, 123.15 MB (186.24 million rows/s., 372.48 MB/s.)  90%↘ Progress: 192.43 million rows, 403.55 MB (43%┌─Year─┬─────divide(c1, c2)─┐
│ 198719.91704692543066 │
│ 198816.61709049583091 │
│ 198919.950091248115527 │
│ 199016.645130151570143 │
│ 199114.721627756959183 │
│ 199214.675431256341861 │
│ 199315.424984631696157 │
│ 199416.568031802021913 │
│ 199519.221638572082064 │
│ 199622.182805887088954 │
│ 199719.16513468701882 │
│ 199819.35637890988224 │
│ 199920.087415003643347 │
│ 200023.17167181619297 │
│ 200118.905807519714198 │
│ 200216.237691267090707 │
│ 200315.024550977569685 │
│ 200419.248380268947592 │
│ 200520.759289560703337 │
│ 200623.155993582679844 │
│ 200724.53487096299114 │
│ 200821.99228614641999 │
│ 200919.752600078911243 │
│ 201020.35832838381071 │
│ 201120.293064527340643 │
│ 201219.324733358461426 │
│ 201322.717310450049155 │
│ 201423.99744596516966 │
│ 201521.30306187628661 │
│ 201619.843393812866502 │
│ 201720.725472238586505 │
│ 201820.937579625604737 │
└──────┴────────────────────┘
↖ Progress: 363.61 million rows, 1.16 GB (477.79 million rows/s., 1.52 GB/s.)  88%↑ Progress: 369.39 million rows, 1.18 GB (485.33 million rows/s., 1.55 GB/s.)  90%
32 rows in set. Elapsed: 0.761 sec. Processed 369.39 million rows, 1.18 GB (485.19 million rows/s., 1.55 GB/s.)
-- 另一个查询版本
cdh3 :) SELECT Year, avg(DepDelay>10)
:-] FROM ontime_all
:-] GROUP BY Year
:-] ORDER BY Year;
SELECT
    Year,
    avg(DepDelay > 10)
FROM ontime_all
GROUP BY Year
ORDER BY Year ASC
↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) → Progress: 31.89 million rows, 191.34 MB (162.62 million rows/s., 975.72 MB/s.)  46%↘ Progress: 125.09 million rows, 750.53 MB (428.87 million rows/s., 2.57 GB/s.)  61%↓ Progress: 170.99 million rows, 1.03 GB (436%┌─Year─┬─avg(greater(DepDelay, 10))─┐
│ 19870.1991704692543066 │
│ 19880.1661709049583091 │
│ 19890.19950091248115528 │
│ 19900.16645130151570142 │
│ 19910.14721627756959182 │
│ 19920.14675431256341862 │
│ 19930.1542498463169616 │
│ 19940.16568031802021913 │
│ 19950.19221638572082064 │
│ 19960.22182805887088955 │
│ 19970.19165134687018823 │
│ 19980.19356378909882238 │
│ 19990.20087415003643347 │
│ 20000.23171671816192968 │
│ 20010.189058075197142 │
│ 20020.16237691267090706 │
│ 20030.15024550977569684 │
│ 20040.19248380268947593 │
│ 20050.20759289560703337 │
│ 20060.23155993582679846 │
│ 20070.2453487096299114 │
│ 20080.2199228614641999 │
│ 20090.19752600078911242 │
│ 20100.20358328383810712 │
│ 20110.20293064527340643 │
│ 20120.19324733358461427 │
│ 20130.22717310450049155 │
│ 20140.2399744596516966 │
│ 20150.21303061876286608 │
│ 20160.198433938128665 │
│ 20170.20725472238586506 │
│ 20180.20937579625604738 │
└──────┴────────────────────────────┘
↙ Progress: 170.99 million rows, 1.03 GB (401.45 million rows/s., 2.41 GB/s.)  83%← Progress: 184.69 million rows, 1.11 GB (433.57 million rows/s., 2.60 GB/s.)  90%
32 rows in set. Elapsed: 0.426 sec. Processed 184.69 million rows, 1.11 GB (433.36 million rows/s., 2.60 GB/s.)

--Q8. 每年更受人们喜爱的目的地
cdh3 :) SELECT DestCityName, uniqExact(OriginCityName) AS u
:-] FROM ontime_all
:-] WHERE Year >= 2000 and Year <= 2010
:-] GROUP BY DestCityName
:-] ORDER BY u DESC LIMIT 10;
SELECT
    DestCityName,
    uniqExact(OriginCityName) AS u
FROM ontime_all
WHERE (Year >= 2000) AND (Year <= 2010)
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↘ Progress: 3.89 million rows, 180.29 MB (27.35 million rows/s., 1.27 GB/s.)  14%↓ Progress: 15.25 million rows, 706.59 MB (62.86 million rows/s., 2.91 GB/s.)  19%↙ Progress: 27.57 million rows, 1.28 GB (79.33 mill%┌─DestCityName──────────┬───u─┐
│ Atlanta, GA           │ 193 │
│ Chicago, IL           │ 167 │
│ Dallas/Fort Worth, TX │ 161 │
│ Cincinnati, OH        │ 138 │
│ Minneapolis, MN       │ 138 │
│ Detroit, MI           │ 130 │
│ Houston, TX           │ 129 │
│ Denver, CO            │ 127 │
│ Salt Lake City, UT    │ 119 │
│ New York, NY          │ 115 │
└───────────────────────┴─────┘
↘ Progress: 70.03 million rows, 3.24 GB (78.09 million rows/s., 3.62 GB/s.)  89%↓ Progress: 72.19 million rows, 3.34 GB (80.46 million rows/s., 3.73 GB/s.)  92%
10 rows in set. Elapsed: 0.898 sec. Processed 72.19 million rows, 3.34 GB (80.37 million rows/s., 3.72 GB/s.)

-- Q9. 每年的航班次数
cdh3 :) SELECT Year, count(*) AS c1
:-] FROM ontime_all
:-] GROUP BY Year;
SELECT
    Year,
    count(*) AS c1
FROM ontime_all
GROUP BY Year
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ↖ Progress: 61.58 million rows, 123.15 MB (349.79 million rows/s., 699.58 MB/s.)  90%↑ Progress: 173.75 million rows, 347.49 MB (629.11 million rows/s., 1.26 GB/s.)  85%┌─Year─┬──────c1─┐
│ 19871311826 │
│ 19885202096 │
│ 19895041200 │
│ 19905270893 │
│ 19915076925 │
│ 19925092157 │
│ 19935070501 │
│ 19945180048 │
│ 19954888012 │
│ 19965351983 │
│ 19975411843 │
│ 19985384721 │
│ 19995527884 │
│ 20005683047 │
│ 20015967780 │
│ 20025271359 │
│ 20036488540 │
│ 20047129270 │
│ 20057140596 │
│ 20067141922 │
│ 20077455458 │
│ 20087009726 │
│ 20096450285 │
│ 20106450117 │
│ 20116085281 │
│ 20126096762 │
│ 20136369482 │
│ 20145819811 │
│ 20155819079 │
│ 20165617658 │
│ 20175674621 │
│ 20187213446 │
└──────┴─────────┘
↗ Progress: 173.75 million rows, 347.49 MB (596.01 million rows/s., 1.19 GB/s.)  85%→ Progress: 184.69 million rows, 369.39 MB (633.47 million rows/s., 1.27 GB/s.)  90%
32 rows in set. Elapsed: 0.292 sec. Processed 184.69 million rows, 369.39 MB (632.92 million rows/s., 1.27 GB/s.)

8.5 和单节点查询对比

-- 查询本地表数据,可以看到执行时间1.466秒
cdh3 :) SELECT min(Year), max(Year), Carrier, count(*) AS cnt,
:-]    sum(ArrDelayMinutes>30) AS flights_delayed,
:-]    round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
:-] FROM ontime
:-] WHERE
:-]    DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
:-]    AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
:-]    AND FlightDate < '2010-01-01'
:-] GROUP by Carrier
:-] HAVING cnt>100000 and max(Year)>1990
:-] ORDER by rate DESC
:-] LIMIT 1000;
SELECT
    min(Year),
    max(Year),
    Carrier,
    count(*) AS cnt,
    sum(ArrDelayMinutes > 30) AS flights_delayed,
    round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate
FROM ontime
WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01')
GROUP BY Carrier
HAVING (cnt > 100000) AND (max(Year) > 1990)
ORDER BY rate DESC
LIMIT 1000
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 7.70 million rows, 90.40 MB (47.77 million rows/s., 561.06 MB/s.)  5%↖ Progress: 16.41 million rows, 192.96 MB (62.75 million rows/s., 737.91 MB/s.)  12%↑ Progress: 24.27 million rows, 285.89 MB (65.50 %┌─min(Year)─┬─max(Year)─┬─Carrier─┬──────cnt─┬─flights_delayed─┬─rate─┐
│      20032009 │ EV      │  14547772376980.16 │
│      20032009 │ B6      │   6838741036770.15 │
│      20032009 │ FL      │  10824891587480.15 │
│      20062009 │ YV      │   7406081103890.15 │
│      20062009 │ XE      │  10160101524310.15 │
│      20032005 │ DH      │   501056698330.14 │
│      20012009 │ MQ      │  32381374480370.14 │
│      20032006 │ RU      │  10072481267330.13 │
│      20042009 │ OH      │  11958681600710.13 │
│      19872009 │ UA      │  965576212035030.12 │
│      20032006 │ TZ      │   136735164960.12 │
│      19872009 │ CO      │  60925756817500.11 │
│      19872009 │ AA      │ 1067856411896720.11 │
│      19872001 │ TW      │  26935872833620.11 │
│      19872009AS15119661479720.1 │
│      19872009 │ NW      │  76482477299200.1 │
│      19882009 │ US      │ 102296649863380.1 │
│      19872009 │ DL      │ 1194884411639240.1 │
│      200720099E      │   577244594400.1 │
│      20032009 │ OO      │  26542592570690.1 │
│      19871991 │ PA      │   218938204970.09 │
│      20052009 │ F9      │   307569286790.09 │
│      19872005 │ HP      │  26284552366330.09 │
│      19872009 │ WN      │ 1272633211080720.09 │
└───────────┴───────────┴─────────┴──────────┴─────────────────┴──────┘
↘ Progress: 124.34 million rows, 1.47 GB (84.89 million rows/s., 1.00 GB/s.) ██████████████████████████████████████████████████████████████████████████████████████████████████████▋          94%
24 rows in set. Elapsed: 1.466 sec. Processed 129.55 million rows, 1.53 GB (88.38 million rows/s., 1.04 GB/s.)

-- 查询分布式表数据,可以看到执行时间4.624秒
cdh3 :) SELECT min(Year), max(Year), Carrier, count(*) AS cnt,
:-]    sum(ArrDelayMinutes>30) AS flights_delayed,
:-]    round(sum(ArrDelayMinutes>30)/count(*),2) AS rate
:-] FROM ontime_all
:-] WHERE
:-]    DayOfWeek NOT IN (6,7) AND OriginState NOT IN ('AK', 'HI', 'PR', 'VI')
:-]    AND DestState NOT IN ('AK', 'HI', 'PR', 'VI')
:-]    AND FlightDate < '2010-01-01'
:-] GROUP by Carrier
:-] HAVING cnt>100000 and max(Year)>1990
:-] ORDER by rate DESC
:-] LIMIT 1000;
SELECT
    min(Year),
    max(Year),
    Carrier,
    count(*) AS cnt,
    sum(ArrDelayMinutes > 30) AS flights_delayed,
    round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate
FROM ontime_all
WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01')
GROUP BY Carrier
HAVING (cnt > 100000) AND (max(Year) > 1990)
ORDER BY rate DESC
LIMIT 1000
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) ← Progress: 1.79 million rows, 20.98 MB (8.73 million rows/s., 102.08 MB/s.)  3%↖ Progress: 2.39 million rows, 28.21 MB (7.82 million rows/s., 92.28 MB/s.)  5%↑ Progress: 6.09 million rows, 71.78 MB (15.01 million %┌─min(Year)─┬─max(Year)─┬─Carrier─┬──────cnt─┬─flights_delayed─┬─rate─┐
│      20032009 │ EV      │  14547772376980.16 │
│      20032009 │ B6      │   6838741036770.15 │
│      20032009 │ FL      │  10824891587480.15 │
│      20062009 │ YV      │   7406081103890.15 │
│      20062009 │ XE      │  10160101524310.15 │
│      20032005 │ DH      │   501056698330.14 │
│      20012009 │ MQ      │  32381374480370.14 │
│      20032006 │ RU      │  10072481267330.13 │
│      20042009 │ OH      │  11958681600710.13 │
│      19872009 │ UA      │  965576212035030.12 │
│      20032006 │ TZ      │   136735164960.12 │
│      19872009 │ CO      │  60925756817500.11 │
│      19872009 │ AA      │ 1067856411896720.11 │
│      19872001 │ TW      │  26935872833620.11 │
│      19872009AS15119661479720.1 │
│      19872009 │ NW      │  76482477299200.1 │
│      19872009 │ DL      │ 1194884411639240.1 │
│      19882009 │ US      │ 102296649863380.1 │
│      200720099E      │   577244594400.1 │
│      20032009 │ OO      │  26542592570690.1 │
│      19871991 │ PA      │   218938204970.09 │
│      20052009 │ F9      │   307569286790.09 │
│      19872005 │ HP      │  26284552366330.09 │
│      19872009 │ WN      │ 1272633211080720.09 │
└───────────┴───────────┴─────────┴──────────┴─────────────────┴──────┘
↘ Progress: 129.51 million rows, 1.53 GB (28.02 million rows/s., 330.40 MB/s.) ████████████████████████████████████████████████████████████████████████████████████████████████████▏                90%
24 rows in set. Elapsed: 4.624 sec. Processed 129.55 million rows, 1.53 GB (28.01 million rows/s., 330.35 MB/s.)

从上面查询可以看到,相对于单节点表的执行效率,分布式表首次查询会稍慢点,但再次查询时分布式表的执行速度就很快了,大概是单节点表的2倍以上。因此如果数据量单节点可以存储和处理,优先创建单节点表,如果数据量很大时,可以考虑创建分布式表。


9 数据的导入导出

数据库的数据导入导出时比较重要的功能,当我们迁移数据库数据或者备份数据库数据等时,需要将数据导出到外部系统或者导入到数据库中。ClickHouse支持多种数据导入导出的格式,支持的详细格式列表可以查看官方文档Formats for Input and Output Data,如果确定了导入的数据库支持二进制且对性能有要求,可以考虑使用二进制压缩格式。这里介绍接种常用的且导入和导出都支持的格式:csv和json。数据集我们使用一份基准测试数据。

9.1 获取基准测试数据

# 1 获取基准数据生成的项目,并编译生成基准数据生成的代码
git clone https://github.com/vadimtk/ssb-dbgen.git
cd ssb-dbgen
make

# 2 生成数据。
#  由于这里我们只针对数据的导入和导出,一次我们只生成 supplier 表的数据 
#  2000 行,大概有  188K。也可以调整-s 后面的参数,更改生成数据规模大小,值越大生成的数据集也越大
./dbgen -s 1 -T s


# 当然这个基准数据生成的项目也可以生成其它表的数据。
# 例如可以生成customer表数据。当s后的值调整为1000时数据集有30000000 行,大概有  3.2G
./dbgen -s 1 -T c
# 例如可以生成lineorder表数据,当s后的参数调整为1000时,数据集大于 50GB
./dbgen -s 1 -T l
# 例如可以生成part表数据,下面命令会生成 200000 行,大概有  20M 的数据
./dbgen -s 1 -T p
# 例如可以生成date表数据,下面命令会生成 2556 行,大概有  272K 的数据
./dbgen -s 1 -T p

# 更详细的可以查看 https://github.com/vadimtk/ssb-dbgen.git 中的README文件

9.2 建表并加载数据

--  1 创建 supplier 表
CREATE TABLE supplier(
 S_SUPPKEY       UInt32,
 S_NAME          String,
 S_ADDRESS       String,
 S_CITY          LowCardinality(String),
 S_NATION        LowCardinality(String),
 S_REGION        LowCardinality(String),
 S_PHONE         String
)ENGINE = MergeTree ORDER BY S_SUPPKEY;

--  2 将数据加载到 supplier 表
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "INSERT INTO supplier FORMAT CSV" < supplier.tbl

9.3 数据的导出

# 1 数据导出
#  说明,--query="SQL",sql的语法格式为 SELECT * FROM 表名 FORMAT 输出格式

#  1.1 以CSV格式,指定需要导出的某些字段信息
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT S_SUPPKEY, S_NAME, S_ADDRESS,S_CITY,S_NATION,S_REGION,S_PHONE FROM supplier  FORMAT CSV" > /opt/supplier.tb0.csv

#  1.2 以CSV格式,导出表中所有字段的数据
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT CSV" > /opt/supplier.tb1.csv
#  查看导出的数据文件 
head -n 5 supplier.tb1.csv
1,"Supplier#000000001","sdrGnXCDRcfriBvY0KL,i","PERU     0","PERU","AMERICA","27-989-741-2988"
2,"Supplier#000000002","TRMhVHz3XiFu","ETHIOPIA 1","ETHIOPIA","AFRICA","15-768-687-3665"
3,"Supplier#000000003","BZ0kXcHUcHjx62L7CjZS","ARGENTINA7","ARGENTINA","AMERICA","11-719-748-3364"
4,"Supplier#000000004","qGTQJXogS83a7MB","MOROCCO  4","MOROCCO","AFRICA","25-128-190-5944"
5,"Supplier#000000005","lONEYAh9sF","IRAQ     5","IRAQ","MIDDLE EAST","21-750-942-6364"

# 1.3 以CSV格式带表头信息形式,导出表中所有字段的数据
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT CSVWithNames" > /opt/supplier.tb2.csv
#  查看导出的数据文件 
head -n 5 supplier.tb2.csv
"S_SUPPKEY","S_NAME","S_ADDRESS","S_CITY","S_NATION","S_REGION","S_PHONE"
1,"Supplier#000000001","sdrGnXCDRcfriBvY0KL,i","PERU     0","PERU","AMERICA","27-989-741-2988"
2,"Supplier#000000002","TRMhVHz3XiFu","ETHIOPIA 1","ETHIOPIA","AFRICA","15-768-687-3665"
3,"Supplier#000000003","BZ0kXcHUcHjx62L7CjZS","ARGENTINA7","ARGENTINA","AMERICA","11-719-748-3364"
4,"Supplier#000000004","qGTQJXogS83a7MB","MOROCCO  4","MOROCCO","AFRICA","25-128-190-5944"

#  1.4 以制表分隔符形式导出数据
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT TabSeparated``" > /opt/supplier.tb3.txt
#  查看导出的数据文件 
head -n 5  supplier.tb3.txt
1       Supplier#000000001      sdrGnXCDRcfriBvY0KL,i   PERU     0      PERU    AMERICA 27-989-741-2988
2       Supplier#000000002      TRMhVHz3XiFu    ETHIOPIA 1      ETHIOPIA        AFRICA  15-768-687-3665
3       Supplier#000000003      BZ0kXcHUcHjx62L7CjZS    ARGENTINA7      ARGENTINA       AMERICA 11-719-748-3364
4       Supplier#000000004      qGTQJXogS83a7MB MOROCCO  4      MOROCCO AFRICA  25-128-190-5944
5       Supplier#000000005      lONEYAh9sF      IRAQ     5      IRAQ    MIDDLE EAST     21-750-942-6364

# 1.5 带表头信息的方式,以制表符方式导出数据文件。TabSeparatedWithNames 等价于 TSVWithNames
# 在解析这种文件时第一行会被完全忽略
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT TSVWithNames" > /opt/supplier.tb4.txt
#  查看导出的数据文件 
head -n 5  supplier.tb4.txt
S_SUPPKEY       S_NAME  S_ADDRESS       S_CITY  S_NATION        S_REGION        S_PHONE
1       Supplier#000000001      sdrGnXCDRcfriBvY0KL,i   PERU     0      PERU    AMERICA 27-989-741-2988
2       Supplier#000000002      TRMhVHz3XiFu    ETHIOPIA 1      ETHIOPIA        AFRICA  15-768-687-3665
3       Supplier#000000003      BZ0kXcHUcHjx62L7CjZS    ARGENTINA7      ARGENTINA       AMERICA 11-719-748-3364
4       Supplier#000000004      qGTQJXogS83a7MB MOROCCO  4      MOROCCO AFRICA  25-128-190-5944

# 1.6 带表头信息的方式,以制表符方式导出数据文件。TabSeparatedWithNamesAndTypes 等价于 TSVWithNamesAndTypes
# 在解析这种文件时前两行会被完全忽略
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT TabSeparatedWithNamesAndTypes" > /opt/supplier.tb5.txt
#  查看导出的数据文件 
head -n 5  supplier.tb5.txt
S_SUPPKEY       S_NAME  S_ADDRESS       S_CITY  S_NATION        S_REGION        S_PHONE
UInt32  String  String  LowCardinality(String)  LowCardinality(String)  LowCardinality(String)  String
1       Supplier#000000001      sdrGnXCDRcfriBvY0KL,i   PERU     0      PERU    AMERICA 27-989-741-2988
2       Supplier#000000002      TRMhVHz3XiFu    ETHIOPIA 1      ETHIOPIA        AFRICA  15-768-687-3665
3       Supplier#000000003      BZ0kXcHUcHjx62L7CjZS    ARGENTINA7      ARGENTINA       AMERICA 11-719-748-3364

# 1.7 以 KV 形式输出每一行,和前面的 TabSeparated 类似,不过是 name=value 的格式
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT TSKV" > /opt/supplier.tb6.txt
#  查看导出的数据文件 
head -n 5  supplier.tb6.txt
S_SUPPKEY=1     S_NAME=Supplier#000000001       S_ADDRESS=sdrGnXCDRcfriBvY0KL,i S_CITY=PERU     0       S_NATION=PERU   S_REGION=AMERICA        S_PHONE=27-989-741-2988
S_SUPPKEY=2     S_NAME=Supplier#000000002       S_ADDRESS=TRMhVHz3XiFu  S_CITY=ETHIOPIA 1       S_NATION=ETHIOPIA       S_REGION=AFRICA S_PHONE=15-768-687-3665
S_SUPPKEY=3     S_NAME=Supplier#000000003       S_ADDRESS=BZ0kXcHUcHjx62L7CjZS  S_CITY=ARGENTINA7       S_NATION=ARGENTINA      S_REGION=AMERICA        S_PHONE=11-719-748-3364
S_SUPPKEY=4     S_NAME=Supplier#000000004       S_ADDRESS=qGTQJXogS83a7MB       S_CITY=MOROCCO  4       S_NATION=MOROCCO        S_REGION=AFRICA S_PHONE=25-128-190-5944
S_SUPPKEY=5     S_NAME=Supplier#000000005       S_ADDRESS=lONEYAh9sF    S_CITY=IRAQ     5       S_NATION=IRAQ   S_REGION=MIDDLE EAST    S_PHONE=21-750-942-6364

# 1.8 以元组形式打印每一行,每个括号用英文逗号分割
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier LIMIT 3  FORMAT Values" > /opt/supplier.tb7.txt
#  查看导出的数据文件 
head  supplier.tb7.txt
(1,'Supplier#000000001','sdrGnXCDRcfriBvY0KL,i','PERU     0','PERU','AMERICA','27-989-741-2988'),(2,'Supplier#000000002','TRMhVHz3XiFu','ETHIOPIA 1','ETHIOPIA','AFRICA','15-768-687-3665'),(3,'Supplier#000000003','BZ0kXcHUcHjx62L7CjZS','ARGENTINA7','ARGENTINA','AMERICA','11-719-748-3364')

# 1.9 以 JSON 形式打印每一行
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT JSONEachRow" > /opt/supplier.tb8.json
#  查看导出的数据文件 
head -n 5 supplier.tb8.json
{"S_SUPPKEY":1,"S_NAME":"Supplier#000000001","S_ADDRESS":"sdrGnXCDRcfriBvY0KL,i","S_CITY":"PERU     0","S_NATION":"PERU","S_REGION":"AMERICA","S_PHONE":"27-989-741-2988"}
{"S_SUPPKEY":2,"S_NAME":"Supplier#000000002","S_ADDRESS":"TRMhVHz3XiFu","S_CITY":"ETHIOPIA 1","S_NATION":"ETHIOPIA","S_REGION":"AFRICA","S_PHONE":"15-768-687-3665"}
{"S_SUPPKEY":3,"S_NAME":"Supplier#000000003","S_ADDRESS":"BZ0kXcHUcHjx62L7CjZS","S_CITY":"ARGENTINA7","S_NATION":"ARGENTINA","S_REGION":"AMERICA","S_PHONE":"11-719-748-3364"}
{"S_SUPPKEY":4,"S_NAME":"Supplier#000000004","S_ADDRESS":"qGTQJXogS83a7MB","S_CITY":"MOROCCO  4","S_NATION":"MOROCCO","S_REGION":"AFRICA","S_PHONE":"25-128-190-5944"}
{"S_SUPPKEY":5,"S_NAME":"Supplier#000000005","S_ADDRESS":"lONEYAh9sF","S_CITY":"IRAQ     5","S_NATION":"IRAQ","S_REGION":"MIDDLE EAST","S_PHONE":"21-750-942-6364"}

# 1.10 以二进制格式逐行格式化和解析数据
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "SELECT * FROM supplier  FORMAT RowBinary" > /opt/supplier.tb9.dat

9.4 数据的导入

数据库操作的可选项

-- 1 修改原有表名
rename table supplier to supplier_bak;

-- 2 删除表
drop table supplier;

-- 3 创建表
CREATE TABLE supplier(
 S_SUPPKEY       UInt32,
 S_NAME          String,
 S_ADDRESS       String,
 S_CITY          LowCardinality(String),
 S_NATION        LowCardinality(String),
 S_REGION        LowCardinality(String),
 S_PHONE         String
)ENGINE = MergeTree ORDER BY S_SUPPKEY;

-- 4 清空表
ALTER TABLE supplier DELETE WHERE 1=1;

开始导入,sql的语法格式为 INSERT INTO 表名 FORMAT 输出格式,输入格式同输出格式,这里以CSV和JSON两种方式为例,其它类似。

# 1 以CSV
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "INSERT INTO supplier FORMAT CSV" < /opt/supplier.tb1.csv

# 2 以JSON形式(导入之前可以先清除表数据)
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "INSERT INTO supplier FORMAT JSONEachRow" < /opt/supplier.tb8.json

10 Movie 数据分析

依旧使用我们前面获取的豆瓣电影的数据,doubanMovie.csv、rankQuote.csv

-- 1 创建一张电影表
CREATE TABLE movie(
 id           UInt64  COMMENT '电影标识',
 movie_name   FixedString(256)  COMMENT '电影名',
 director     FixedString(128)  COMMENT '导演',
 scriptwriter FixedString(256)  COMMENT '编剧',
 protagonist  String  COMMENT '主演',
 kind         FixedString(64) COMMENT '类型',
 country      FixedString(64) COMMENT '地区',
 language     String COMMENT '语言', 
 release_date FixedString(128) COMMENT '上映日期',
 mins         String COMMENT '片长',
 alternate_name FixedString(256) COMMENT '又名',
 imdb         FixedString(128) COMMENT 'IMDb链接',
 rating_num   Float32  COMMENT '评分',
 rating_people UInt32  COMMENT '评分人数',
 url          FixedString(256) COMMENT 'url'
)ENGINE = MergeTree ORDER BY id;

-- 2 创建豆瓣电影排名与评语表
CREATE TABLE quote(
 id     UInt64 COMMENT '电影标识',
 rank   UInt32 COMMENT '排名',
 quote  String COMMENT '语录'
)ENGINE = MergeTree ORDER BY id;

# 3 导入数据
#  3.1 导入 Movie 数据
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "INSERT INTO movie FORMAT CSV" < /home/doubanMovie.csv 

#  3.2 导入 quote 数据
clickhouse-client -h 127.0.0.1 --port 9000 -u default --password KavrqeN1 --query "INSERT INTO quote FORMAT CSV" < /home/rankQuote.csv

获取评分最高且评论人数最多的电影评语,ClickHouse对Join支持有:INNER JOIN (或者 JOIN)、LEFT JOIN (或者 LEFT OUTER JOIN)、RIGHT JOIN (或者 RIGHT OUTER JOIN)、FULL JOIN (或者 FULL OUTER JOIN)、CROSS JOIN ,因此前面的SQL可以直接复制过来执行,不需要修改。

-- 4 执行SQL分析查看数据。获取评分最高且评论人数最多的电影评语
-- ClickHouse支持的Join语法可以查看:https://clickhouse.yandex/docs/en/query_language/select/#select-join
cdh3 :) SELECT m.id,m.movie_name,m.rating_num,m.rating_people,q.rank,q.quote FROM movie m
:-] LEFT JOIN quote q
:-] ON q.id=m.id
:-] ORDER BY m.rating_num DESC,m.rating_people DESC LIMIT 10;
SELECT
    m.id,
    m.movie_name,
    m.rating_num,
    m.rating_people,
    q.rank,
    q.quote
FROM movie AS m
LEFT JOIN quote AS q ON q.id = m.id
ORDER BY
    m.rating_num DESC,
    m.rating_people DESC
LIMIT 10
↑ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) 
┌──────id─┬─movie_name───────────────────────────────────┬─rating_num─┬─rating_people─┬─rank─┬─quote──────────────────────────┐
│ 1292052 │ 肖申克的救赎 The Shawshank Redemption (1994)9.715028511 │ 希望让人自由。                 │
│ 1291546 │ 霸王别姬 (1993)9.611126412 │ 风华绝代。                     │
│ 1296141 │ 控方证人 Witness for the Prosecution (1957)9.619536229 │ 比利·怀德满分作品。            │
│ 1292063 │ 美丽人生 La vita è bella (1997)9.56906185 │ 最美的谎言。                   │
│ 1295124 │ 辛德勒的名单 Schindler's List (1993)9.56138658 │ 拯救一个人,就是拯救整个世界。 │
│ 1295644 │ 这个杀手不太冷 Léon (1994)9.413634303 │ 怪蜀黍和小萝莉不得不说的故事。 │
│ 1292720 │ 阿甘正传 Forrest Gump (1994)9.411780034 │ 一部美国近现代史。             │
│ 1292722 │ 泰坦尼克号 Titanic (1997)9.411194057 │ 失去的才是永恒的。             │
│ 1293182 │ 十二怒汉 12 Angry Men (1957)9.4253408361957年的理想主义。             │
│ 1291561 │ 千与千寻 千と千尋の神隠し (2001)9.312052286 │ 最好的宫崎骏,最好的久石让。   │
└─────────┴──────────────────────────────────────────────┴────────────┴───────────────┴──────┴────────────────────────────────┘
↗ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) → Progress: 500.00 rows, 84.28 KB (86.34 thousand rows/s., 14.55 MB/s.)  3%
10 rows in set. Elapsed: 0.006 sec.

-- 5 修改数据
--  在 quote 表中 id 为 5908478 数据的评语没有句号,我们修改这条数据给它加上句号
cdh3 :) SELECT * FROM quote WHERE id=5908478;
┌──────id─┬─rank─┬─quote────────────────────────────────────────┐
│ 5908478244 │ 你要相信,这世上真的有爱存在,不管在什么年纪 │
└─────────┴──────┴──────────────────────────────────────────────┘
-- 修改
cdh3 :) ALTER TABLE quote UPDATE quote = "你要相信,这世上真的有爱存在,不管在什么年纪。" where id=5908478;
ALTER TABLE quote
    UPDATE quote = "你要相信,这世上真的有爱存在,不管在什么年纪。" WHERE id = 5908478
Ok.
0 rows in set. Elapsed: 0.016 sec.
-- 再次查看数据,
cdh3 :) SELECT * FROM quote WHERE id=5908478;
┌──────id─┬─rank─┬─quote──────────────────────────────────────────┐
│ 5908478244 │ 你要相信,这世上真的有爱存在,不管在什么年纪。 │
└─────────┴──────┴────────────────────────────────────────────────┘



你可能感兴趣的:(大数据,数据库)