记录过程
个人总结式理解,详细的去官网看吧
ClickHouse可以在任何具有x86_64,AArch64或PowerPC64LE CPU架构的Linux,FreeBSD或Mac OS X上运行
检查环境是否支持
[root@bigdata01 module]# grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
SSE 4.2 supported
CentOS取消打开文件数限制
分别编辑如下两个文件
vim /etc/security/limits.conf
vim /etc/security/limits.d/20-nproc.conf
注意有些环境可能不叫20-nproc.conf,变通下,先ls /etc/security/limits.d看看叫啥名
增加如下内容,注意*号也要
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072
重启服务器之后生效,用ulimit -n
或者ulimit -a
查看设置结果
安装依赖
yum install -y libtool
yum install -y *unixODBC*
下载说明,不是要去下载,可以直接使用yum安装,如下只是个说明
官网:https://clickhouse.yandex/
下载地址:http://repo.red-soft.biz/repos/clickhouse/stable/el7/
https://packagecloud.io/Altinity/clickhouse
这里下载半年前的,clickHouse版本更新很快,需注意更新内容
安装的版本:*-19.15.5.18-1.el7.x86_64.rpm
包括:
- clickhouse-test-19.15.5.18-1.el7.x86_64.rpm (测试模块可不必安装)
- clickhouse-server-common-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-server-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-debuginfo-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-common-static-19.15.5.18-1.el7.x86_64.rpm
- clickhouse-client-19.15.5.18-1.el7.x86_64.rpm
下载yum源
curl -s https://packagecloud.io/install/repositories/Altinity/clickhouse/script.rpm.sh | sudo bash
Yum安装
如下是安装指定版本,若安装最新版则可直接 sudo yum install -y clickhouse-server clickhouse-client
sudo yum install clickhouse-server-common-19.15.5.18-1.el7.x86_64
sudo yum install clickhouse-server-19.15.5.18-1.el7.x86_64 注意:这个会同时依赖安装 clickhouse-common-static
sudo yum install clickhouse-debuginfo-19.15.5.18-1.el7.x86_64
sudo yum install clickhouse-client-19.15.5.18-1.el7.x86_64
检查安装情况:
sudo yum list installed 'clickhouse*'
各个安装的组件文件分布情况
可以从https://packagecloud.io/Altinity/clickhouse点进去对应版本对应组建里看到File的分布情况,这里列举几个关注度较高的文件目录
/etc/clickhouse-client/config.xml
/usr/bin/clickhouse-client
/usr/bin/clickhouse-benchmark
/etc/clickhouse-server/users.xml
/etc/clickhouse-server/config.xml
/usr/bin/clickhouse-server
/etc/security/limits.d/clickhouse.conf
/etc/init.d/clickhouse-server
/etc/cron.d/clickhouse-server
服务端配置
注意修改了服务端配置要重启服务哦
配置文件在/etc/clickhouse-server
目录下
users.xml 用户配置信息。默认有个default用户无密码。
增加用户的话直接参考default用户的配置方式,也就是标签配置方式去增加即可
config.xml 服务的配置信息。可修改端口号、绑定IP、安全信息等
客户端配置
执行clickhouse命令时,默认会读取/etc/clickhouse-client/config.xml配置文件进行启动客户端
可通过-c
参数指定config.xml位置如clickhouse-client -c /opt/software/config.xml
/etc/clickhouse-client/config.xml记录的是连接服务端的一些信息
service clickhouse-server start
service clickhouse-server status
[root@bigdata01 ~]# service clickhouse-server start
Start clickhouse-server service: Path to data directory in /etc/clickhouse-server/config.xml: /var/lib/clickhouse/
DONE
[root@bigdata01 ~]# service clickhouse-server status
clickhouse-server service is running
[root@bigdata01 ~]# clickhouse-client
ClickHouse client version 19.15.5.18.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.15.5 revision 54426.
bigdata01 :) show databases;
SHOW DATABASES
┌─name────┐
│ default │
│ system │
└─────────┘
2 rows in set. Elapsed: 0.001 sec.
指定端口或服务地址加参数 --port 8080 --host 127.0.0.1
每台机器都按如上单机安装步骤安装好的前提下
每台机器修改/etc/clickhouse-server/config.xml
<listen_host>::listen_host>
每台机器etc目录下新建metrika.xml文件
vim /etc/metrika.xml
添加如下内容
<yandex>
<clickhouse_remote_servers>
<perftest_3shards_1replicas>
<shard>
<internal_replication>trueinternal_replication>
<replica>
<host>bigdata01host>
<port>19000port>
replica>
shard>
<shard>
<replica>
<internal_replication>trueinternal_replication>
<host>bigdata02host>
<port>19000port>
replica>
shard>
<shard>
<internal_replication>trueinternal_replication>
<replica>
<host>bigdata03host>
<port>19000port>
replica>
shard>
perftest_3shards_1replicas>
clickhouse_remote_servers>
<zookeeper-servers>
<node index="1">
<host>bigdata01host>
<port>32181port>
node>
<node index="2">
<host>bigdata02host>
<port>32181port>
node>
<node index="3">
<host>bigdata03host>
<port>32181port>
node>
zookeeper-servers>
<macros>
<replica>bigdata01replica>
macros>
<networks>
<ip>::/0ip>
networks>
<clickhouse_compression>
<case>
<min_part_size>10000000000min_part_size>
<min_part_size_ratio>0.01min_part_size_ratio> <method>lz4method>
case>
clickhouse_compression>
yandex>
启动每台机器
注意先启动zookeeper
列举安装了哪些模块
[root@bigdata01 ~]# yum list installed | grep clickhouse
clickhouse-client.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse
clickhouse-common-static.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse
clickhouse-debuginfo.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse
clickhouse-server.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse
clickhouse-server-common.x86_64 19.15.5.18-1.el7 @Altinity_clickhouse
依次卸载模块
yum remove -y clickhouse-client.x86_64 clickhouse-common-static.x86_64 clickhouse-debuginfo.x86_64 clickhouse-server.x86_64 clickhouse-server-common.x86_64
再次全局检查剩余文件然后删除
find / -name 'clickhouse'
rm -rf 查出来的结果
卸载报错时强制删除
# 删除rpm包的时候不调用卸载脚本
sudo rpm -e clickhouse-server.x86_64 --noscripts
使用官网提供的航班飞行数据进行测试:19872017年的。由于存储空间有限,故只用20002017年的数据进行测试
测试机器情况:百度云服务器:2核/4GB/40GB/计算型c3 1Mbps
经测试如下大数据集并没有达到机器性能极限。
如下测试,官网都有介绍
官网下载数据参考:https://clickhouse.tech/docs/zh/getting_started/example_datasets/ontime/
创建表结构(注意登陆时clickhouse -m
如果不加-m启用多行会报错)
CREATE TABLE `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree
PARTITION BY Year
ORDER BY (Carrier, FlightDate)
SETTINGS index_granularity = 8192;
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=127.0.0.1 --p 19000 --query="INSERT INTO per_test.ontime FORMAT CSVWithNames"; done
查询从2000年到2008年每天的航班数
SELECT
DayOfWeek,
count(*) AS c
FROM ontime
WHERE (Year >= 2000) AND (Year <= 2008)
GROUP BY DayOfWeek
ORDER BY c DESC
┌─DayOfWeek─┬───────c─┐
│ 1 │ 1024694 │
│ 3 │ 1019282 │
│ 2 │ 1015141 │
│ 5 │ 1014324 │
│ 4 │ 1013083 │
│ 7 │ 979170 │
│ 6 │ 908404 │
└───────────┴─────────┘
7 rows in set. Elapsed: 0.042 sec. Processed 6.97 million rows, 20.92 MB (167.64 million rows/s., 502.92 MB/s.)
查询从2000年到2008年每周延误超过10分钟的航班数
SELECT
DayOfWeek,
count(*) AS c
FROM ontime
WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
GROUP BY DayOfWeek
ORDER BY c DESC
┌─DayOfWeek─┬──────c─┐
│ 5 │ 274999 │
│ 4 │ 254490 │
│ 7 │ 238941 │
│ 1 │ 209985 │
│ 3 │ 201997 │
│ 6 │ 183685 │
│ 2 │ 178767 │
└───────────┴────────┘
7 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 48.82 MB (44.71 million rows/s., 313.00 MB/s.)
查询2000年到2008年每个机场延误超过10分钟以上的次数
SELECT
Origin,
count(*) AS c
FROM ontime
WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
GROUP BY Origin
ORDER BY c DESC
LIMIT 10
┌─Origin─┬──────c─┐
│ ORD │ 105023 │
│ ATL │ 73496 │
│ DFW │ 67485 │
│ PHX │ 66968 │
│ LAX │ 66964 │
│ LAS │ 50462 │
│ STL │ 47812 │
│ DEN │ 46164 │
│ SFO │ 43537 │
│ DTW │ 43341 │
└────────┴────────┘
10 rows in set. Elapsed: 0.156 sec. Processed 6.97 million rows, 76.72 MB (44.59 million rows/s., 490.50 MB/s.)
查询2000至2008年各航空公司延误超过10分钟以上的百分比
SELECT
Carrier,
c,
c2,
(c * 100) / c2 AS c3
FROM
(
SELECT
Carrier,
count(*) AS c
FROM ontime
WHERE (DepDelay > 10) AND (Year >= 2000) AND (Year <= 2008)
GROUP BY Carrier
)
INNER JOIN
(
SELECT
Carrier,
count(*) AS c2
FROM ontime
WHERE (Year >= 2000) AND (Year <= 2008)
GROUP BY Carrier
) USING (Carrier)
ORDER BY c3 DESC
┌─Carrier─┬──────c─┬──────c2─┬─────────────────c3─┐
│ UA │ 262451 │ 915911 │ 28.654640025067938 │
│ AS │ 51977 │ 188884 │ 27.51794752334766 │
│ WN │ 314159 │ 1148649 │ 27.350304575200955 │
│ HP │ 69859 │ 264180 │ 26.44371262018321 │
│ US │ 185689 │ 886115 │ 20.95540646530078 │
│ AA │ 181789 │ 896349 │ 20.281051242317446 │
│ TW │ 64220 │ 319764 │ 20.08356162669969 │
│ DL │ 199886 │ 1089116 │ 18.353049629240594 │
│ NW │ 115102 │ 667317 │ 17.24847411350228 │
│ CO │ 78593 │ 474145 │ 16.575731052737034 │
│ MQ │ 17229 │ 108410 │ 15.89244534637026 │
│ AQ │ 1910 │ 15258 │ 12.518023332022546 │
└─────────┴────────┴─────────┴────────────────────┘
12 rows in set. Elapsed: 0.186 sec. Processed 13.95 million rows, 83.69 MB (75.06 million rows/s., 450.37 MB/s.)
更好的查询语句版本
SELECT
Carrier,
avg(DepDelay > 10) * 100 AS c3
FROM ontime
WHERE (Year >= 2000) AND (Year <= 2008)
GROUP BY Carrier
ORDER BY c3 DESC
┌─Carrier─┬─────────────────c3─┐
│ UA │ 28.65464002506794 │
│ AS │ 27.517947523347665 │
│ WN │ 27.35030457520095 │
│ HP │ 26.443712620183206 │
│ US │ 20.95540646530078 │
│ AA │ 20.281051242317446 │
│ TW │ 20.08356162669969 │
│ DL │ 18.353049629240594 │
│ NW │ 17.24847411350228 │
│ CO │ 16.575731052737034 │
│ MQ │ 15.892445346370259 │
│ AQ │ 12.518023332022546 │
└─────────┴────────────────────┘
12 rows in set. Elapsed: 0.129 sec. Processed 6.97 million rows, 55.79 MB (53.97 million rows/s., 431.75 MB/s.)
每年航班延误超过10分钟的百分比
SELECT
Year,
avg(DepDelay > 10) * 100
FROM ontime
GROUP BY Year
ORDER BY Year ASC
┌─Year─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
│ 2000 │ 23.17167181619297 │
│ 2001 │ 17.505660117222323 │
└──────┴───────────────────────────────────────────┘
2 rows in set. Elapsed: 0.084 sec. Processed 6.97 million rows, 41.84 MB (83.21 million rows/s., 499.26 MB/s.)
每年更受人们喜爱的目的地
SELECT
DestCityName,
uniqExact(OriginCityName) AS u
FROM ontime
WHERE (Year >= 2000) AND (Year <= 2010)
GROUP BY DestCityName
ORDER BY u DESC
LIMIT 10
┌─DestCityName──────────┬───u─┐
│ Chicago, IL │ 117 │
│ Dallas/Fort Worth, TX │ 115 │
│ Atlanta, GA │ 100 │
│ Minneapolis, MN │ 88 │
│ Houston, TX │ 81 │
│ Detroit, MI │ 81 │
│ St. Louis, MO │ 76 │
│ Charlotte, NC │ 70 │
│ Pittsburgh, PA │ 69 │
│ Newark, NJ │ 67 │
└───────────────────────┴─────┘
10 rows in set. Elapsed: 0.559 sec. Processed 6.97 million rows, 322.81 MB (12.47 million rows/s., 577.07 MB/s.)
Q10
SELECT
min(Year),
max(Year),
Carrier,
count(*) AS cnt,
sum(ArrDelayMinutes > 30) AS flights_delayed,
round(sum(ArrDelayMinutes > 30) / count(*), 2) AS rate
FROM ontime
WHERE (DayOfWeek NOT IN (6, 7)) AND (OriginState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (DestState NOT IN ('AK', 'HI', 'PR', 'VI')) AND (FlightDate < '2010-01-01')
GROUP BY Carrier
HAVING (cnt > 100000) AND (max(Year) > 1990)
ORDER BY rate DESC
LIMIT 1000
┌─min(Year)─┬─max(Year)─┬─Carrier─┬────cnt─┬─flights_delayed─┬─rate─┐
│ 2000 │ 2001 │ UA │ 649862 │ 121889 │ 0.19 │
│ 2000 │ 2001 │ HP │ 192068 │ 27480 │ 0.14 │
│ 2000 │ 2001 │ AA │ 615877 │ 79539 │ 0.13 │
│ 2000 │ 2001 │ US │ 638984 │ 79708 │ 0.12 │
│ 2000 │ 2001 │ TW │ 224711 │ 25778 │ 0.11 │
│ 2000 │ 2001 │ WN │ 855501 │ 97260 │ 0.11 │
│ 2000 │ 2001 │ NW │ 483807 │ 52931 │ 0.11 │
│ 2000 │ 2001 │ CO │ 349268 │ 38560 │ 0.11 │
│ 2000 │ 2001 │ DL │ 764713 │ 79954 │ 0.1 │
└───────────┴───────────┴─────────┴────────┴─────────────────┴──────┘
9 rows in set. Elapsed: 0.370 sec. Processed 6.97 million rows, 84.86 MB (18.86 million rows/s., 229.45 MB/s.)
Q多维度1
SELECT
Year,
OriginCityName,
DepartureDelayGroups,
CancellationCode,
avg(DepDelay > 10) * 100
FROM ontime
GROUP BY
Year,
OriginCityName,
DepartureDelayGroups,
CancellationCode
ORDER BY Year ASC
LIMIT 1
┌─Year─┬─OriginCityName───┬─DepartureDelayGroups─┬─CancellationCode─┬─multiply(avg(greater(DepDelay, 10)), 100)─┐
│ 2000 │ Fayetteville, NC │ 7 │ │ 100 │
└──────┴──────────────────┴──────────────────────┴──────────────────┴───────────────────────────────────────────┘
1 rows in set. Elapsed: 0.613 sec. Processed 6.97 million rows, 275.48 MB (11.38 million rows/s., 449.40 MB/s.)
综上测试,性能极佳。OLAP分析一大神器