Hadoop提供了大数据存储与计算的一整套解决方案;但是它采用的是MapReduce计算框架,只适合离线和批量计算,无法满足快速实时的Ad-Hoc查询计算的性能要求
Hive使用MapReduce作为底层计算框架,是专为批处理设计的。但随着数据越来越多,使用Hive进行一个简单的数据查询可能要花费几分到几小时,显然不能满足交互式查询的需求。
Facebook于2012年秋开始开发了Presto,每日查询数据量在1PB级别。Facebook称Presto的性能比Hive要快上10倍多。2013年Facebook正式宣布开源Presto。
Presto是apache下开源的OLAP的分布式SQL查询引擎,数据量支持从GB到PB级别的数据量的查询,并且查询时,能做到秒级查询。
另外,Presto虽然可以解析SQL,但它并非是标准的数据库;不能替代如MySQL、PostgreSQL、Oracle关系型数据库,不是用于处理OLTP的
presto是利用分布式查询,高效的对海量数据进行查询;
presto可以用来查询hdfs上的海量数据;但是,presto不仅仅可以用来查询hdfs的数据,它还被设计成能够对很多其他的数据源的数据做查询;
比如数据源有HDFS、Hive、Druid、Kafka、kudu、MySQL、Redis等;下图是Presto 0.237支持的数据源
官网地址:https://prestodb.io/
github地址
presto集群规划
主机名 | 角色 |
---|---|
node01 | coordinator |
node02 | worker |
node03 | worker |
python -V
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.237/presto-server-0.237.tar.gz
然后将tar.gz包上传到node01的/kkb/soft目录
cd /kkb/soft/
tar -xzvf presto-server-0.237.tar.gz -C /kkb/install/
cd /kkb/soft/
tar -xzvf jdk-8u251-linux-x64.tar.gz -C /kkb/install/
cd /kkb/install/
scp -r jdk1.8.0_251/ node02:$PWD
scp -r jdk1.8.0_251/ node03:$PWD
ln -s presto-server-0.237/ presto
vim /kkb/install/presto/bin/launcher
添加如下内容
PATH=/kkb/install/jdk1.8.0_251/bin:$PATH
java -version
注意:需要加在exec "$(dirname " 0 " ) / l a u n c h e r . p y " " 0")/launcher.py" " 0")/launcher.py""@"之前
cd /kkb/install
cd presto
mkdir data
mkdir etc
cd /kkb/install/presto/etc
vim jvm.config
-server-Xmx16G-XX:+UseG1GC-XX:G1HeapRegionSize=32M-XX:+UseGCOverheadLimit-XX:+ExplicitGCInvokesConcurrent-XX:+HeapDumpOnOutOfMemoryError-XX:+ExitOnOutOfMemoryError
cd /kkb/install/presto-server-0.237/etcmkdir catalogcd catalogvim hive.properties
connector.name=hive-hadoop2hive.metastore.uri=thrift://node03:9083
cd /kkb/install/scp -r presto node02:/kkb/install/scp -r presto node03:/kkb/install/
/kkb/install/presto/etc
目录,修改node.properties
文件cd /kkb/install/presto/etcvim node.properties
# node01如下内容node.environment=productionnode.id=ffffffff-ffff-ffff-ffff-fffffffffff1node.data-dir=/kkb/install/presto/data# node2如下内容node.environment=productionnode.id=ffffffff-ffff-ffff-ffff-fffffffffff2node.data-dir=/kkb/install/presto/data# node03如下内容node.environment=productionnode.id=ffffffff-ffff-ffff-ffff-fffffffffff3node.data-dir=/kkb/install/presto/data
说明:
node.environment 环境的名称;presto集群各节点的此名称必须保持一致
node.id presto每个节点的id,必须唯一
node.data-dir 存储log及其他数据的目录
通过配置config.properties文件,指明server是coordinator还是worker
虽然presto server可以同时作为coordinator和worker;但是为了更好的性能,一般让server要么作为coordinator,要么作为worker
presto是主从架构;主是coordinator,从是worker
现设置node01作为coordinator节点;node02、node03节点作为worker节点
node01上配置coordinator
cd /kkb/install/presto/etcvim config.properties
coordinator=truenode-scheduler.include-coordinator=falsehttp-server.http.port=8880query.max-memory=50GBquery.max-memory-per-node=1GBdiscovery-server.enabled=truediscovery.uri=http://node01:8880
说明:
coordinator=true 允许此presto实例作为coordinator
node-scheduler.include-coordinator 是否允许在coordinator上运行work
http-server.http.port presto使用http服务进行内部、外部的通信;指定http server的端口
query.max-memory 一个查询运行时,使用的所有的分布式内存的总量的上限
query.max-memory-per-node query在执行时,使用的任何一个presto服务器上使用的内存上限
discovery-server.enabled presto使用discovery服务,用来发现所有的presto节点
discovery.uri discovery服务的uri
cd /kkb/install/presto/etcvim config.properties
coordinator=falsehttp-server.http.port=8880query.max-memory=50GBdiscovery.uri=http://node01:8880
nohup hive --service metastore > /dev/null 2>&1 &
cd /kkb/install/presto# 前台启动,控制台打印日志bin/launcher run# 或使用后台启动prestobin/launcher start
/kkb/install/presto/data/var/log
/kkb/soft
cd /kkb/softmv presto-cli-0.237-executable.jar prestocli
chmod u+x prestocli
注意:先启动HDFS
查看presto客户端jar包的使用方式
./prestocli --help
./prestocli --server node01:8880 --catalog hive --schema default
说明:
–catalog hive 中的hive指的是etc/catalog中的hive.properties的文件名
java -jar presto-cli-0.237-executable.jar --server node01:8880 --catalog hive --schema default
quit
Presto的命令行操作,相当于Hive命令行操作。每个表必须要加上schema前缀;例如
select * from schema.table limit 5或者切换到指定的schema,再查询表数据use myhive;select * from score limit 3;
cd /kkb/softunzip -d /kkb/install yanagishima-18.0.zip# 若出现-bash: unzip: command not found,表示没有安装unzip;需要安装;然后再解压缩sudo yum -y install unzip zipcd /kkb/install/yanagishima-18.0
cd /kkb/install/yanagishima-18.0/confvim yanagishima.properties
jetty.port=7080presto.datasources=kkb-prestopresto.coordinator.server.kkb-presto=http://node01:8880catalog.kkb-presto=hiveschema.kkb-presto=defaultsql.query.engines=presto
nohup bin/yanagishima-start.sh >yanagishima.log 2>&1 &
node01上多出名为YanagishimaServer的进程
启动web界面
http://node01:7080
在界面中进行查询了
若ui界面显示很慢,或者不显示,可以尝试将node01替换成相应的ip地址
查看表结构;
每个表后面都有个复制键,点一下会复制完整的表名,然后再上面框里面输入sql语句,ctrl+enter组合键或Run按钮执行显示结果
这里有个Tree View,可以查看所有表的结构,包括Schema、表、字段等。
比如执行select * from hive.myhive.score
,这个句子里Hive这个词可以删掉,即变成select * from myhive.score
;hive是上面配置的Catalog名称
注意:sql语句末尾不要加分号;否则报错
以下用hive connector演示
查看schema有哪些
SHOW SCHEMAS;
SHOW TABLES;
语法:CREATE SCHEMA [ IF NOT EXISTS ] schema_nameCREATE SCHEMA testschema;
语法:DROP SCHEMA [ IF EXISTS ] schema_namedrop schema testschema;
语法:CREATE TABLE [ IF NOT EXISTS ]table_name (column_name data_type [ COMMENT comment],... ]create table stu4(id int, name varchar(20));
语法:CREATE TABLE [ IF NOT EXISTS ] table_name [ ( column_alias, ... ) ][ COMMENT table_comment ][ WITH ( property_name = expression [, ...] ) ]AS query[ WITH [ NO ] DATA ]create table if not exists myhive.stu5 as select id, name from stu1;
语法:DELETE FROM table_name [ WHERE condition ]说明:hive connector只支持一次性的删除一个完整的分区;不支持删除一行数据DELETE FROM order_partition where month='2019-03';
DESCRIBE hive.myhive.stu1;
语法:ANALYZE table_nameANALYZE hive.myhive.stu1;
语法:PREPARE statement_name FROM statementprepare my_select1 from select * from score;execute my_select1;prepare my_select2 from select * from score where s_score < 90 and s_score > 70;execute my_select2;prepare my_select3 from select * from score where s_score < ? and s_score > ?;execute my_select3 using 90, 70;
语法:EXPLAIN [ ( option [, ...] ) ] statementwhere option can be one of: FORMAT { TEXT | GRAPHVIZ | JSON } TYPE { LOGICAL | DISTRIBUTED | VALIDATE | IO }查询逻辑计划语句:explain select s_id, avg(s_score) from score group by s_id;等价于explain (type logical)select s_id, avg(s_score) from score group by s_id;查询分布式执行计划distributed execution planexplain (type distributed)select s_id, avg(s_score) from score group by s_id;校验语句的正确性explain (type validate)select s_id, avg(s_score) from score group by s_id;explain (type io, format json)select s_id, avg(s_score) from score group by s_id;
语法:[ WITH with_query [, ...] ]SELECT [ ALL | DISTINCT ] select_expr [, ...][ FROM from_item [, ...] ][ WHERE condition ][ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ][ HAVING condition][ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ][ ORDER BY expression [ ASC | DESC ] [, ...] ][ LIMIT [ count | ALL ] ]from_item:table_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]from_item join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]join_type:[ INNER ] JOINLEFT [ OUTER ] JOINRIGHT [ OUTER ] JOINFULL [ OUTER ] JOINCROSS JOINgrouping_element:()expressionGROUPING SETS ( ( column [, ...] ) [, ...] )CUBE ( column [, ...] )ROLLUP ( column [, ...] )语句:with语句:用于简化内嵌的子查询select a, bfrom (select s_id as a, avg(s_score) as b from score group by s_id) as tbl1;等价于:with tbl1 as (select s_id as a, avg(s_score) as b from score group by s_id)select a, b from tbl1;多个子查询也可以用withWITH t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a), t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)SELECT t1.*, t2.*FROM t1JOIN t2 ON t1.a = t2.a;with语句中的关系可以串起来(chain)WITH x AS (SELECT a FROM t), y AS (SELECT a AS b FROM x), z AS (SELECT b AS c FROM y)SELECT c FROM z;group by:select s_id as a, avg(s_score) as b from score group by s_id;等价于:select s_id as a, avg(s_score) as b from score group by 1;1代表查询输出中的第一列s_idselect count(*) as b from score group by s_id;
合理设置分区
与Hive类似,Presto会根据元信息读取分区数据,合理的分区能减少Presto数据读取量,提升查询性能。
使用列式存储
Presto对ORC文件读取做了特定优化,因此在Hive中创建Presto使用的表时,建议采用ORC格式存储。相对于Parquet,Presto对ORC支持更好。
使用压缩
数据压缩可以减少节点间数据传输对IO带宽压力,对于即席查询需要快速解压,建议采用snappy压缩
预先排序
对于已经排序的数据,在查询的数据过滤阶段,ORC格式支持跳过读取不必要的数据。比如对于经常需要过滤的字段可以预先排序。
列剪裁
只选择使用必要的字段: 由于采用列式存储,选择需要的字段可加快字段的读取、减少数据量。避免采用*读取所有字段
[GOOD]: SELECT s_id, c_id FROM score[BAD]: SELECT * FROM score
过滤条件必须加上分区字段
对于分区表,where语句中优先使用分区字段进行过滤。day是分区字段,vtime是具体访问时间
[GOOD]: SELECT vtime, stu, address FROM tbl where day=20200501[BAD]: SELECT * FROM tbl where vtime=20200501
Group By语句优化:
合理安排Group by语句中字段顺序对性能有一定提升。将Group By语句中字段按照每个字段distinct数据多少进行降序排列, 减少GROUP BY语句后面的排序一句字段的数量能减少内存的使用.
uid个数多;gender少[GOOD]: SELECT GROUP BY uid, gender[BAD]: SELECT GROUP BY gender, uid
[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100[BAD]: SELECT * FROM tbl ORDER BY time
select approx_distinct(s_id) from score;
SELECT...FROMaccessWHEREmethod LIKE '%GET%' ORmethod LIKE '%POST%' ORmethod LIKE '%PUT%' ORmethod LIKE '%DELETE%'优化:SELECT...FROMaccessWHEREregexp_like(method, 'GET|POST|PUT|DELETE')
[GOOD] SELECT ... FROM large_table l join small_table s on l.id = s.id[BAD] SELECT ... FROM small_table s join large_table l on l.id = s.id
使用Rank函数代替row_number函数来获取Top N
UNION ALL 代替 UNION :不用去重
使用WITH语句: 查询语句非常复杂或者有多层嵌套的子查询,请试着用WITH语句将子查询分离出来
避免和关键字冲突:MySQL对字段加反引号**`**;Presto对字段加双引号分割
当然,如果字段名称不是关键字,可以不加这个双引号。
/*MySQL的写法*/SELECT t FROM a WHERE t > '2020-05-01 00:00:00'; /*Presto的写法*/SELECT t FROM a WHERE t > timestamp '2020-05-01 00:00:00';