presto和mysql对比

一、概述

官网:https://prestodb.io/

国内京东官网:https://prestodb.jd.com/

 

Presto是由Facebook开发的一个分布式SQL查询引擎, 它被设计为用来专门进行高速、实时的数据分析。

它的产生是为了解决Hive的MapReduce模型太慢以及不能通过BI或Dashboards直接展现HDFS数据等问题。

Presto是一个纯粹的计算引擎,它不存储数据,其通过Connector获取第三方Storage服务的数据。

 

二、详细介绍

1.架构

  • Master-Slave架构
  • 三个模块
    • Coordinator、Discovery Service、Worker
  • Connector

Presto沿用了通用的Master-Slave架构,Coordinator即Presto的Master,Worker即其Slave,Discovery Service就是用来保存Worker结点信息的,通过HTTP协议通信,而Connector用于获取第三方存储的Metadata及原始数据等。

Coordinator负责解析SQL语句,生成执行计划,分发执行任务给Worker节点执行;Worker节点负责实际执行查询任务。Worker节点启动后向Discovery Server服务注册,Coordinator从Discovery Server获得可以正常工作的Worker节点。假如配置了Hive Connector,需要配置一个Hive MetaStore服务为Presto提供Hive元信息,Worker节点与HDFS交互读取数据。

 

2.查询流程

  • Client使用HTTP协议发送一个query请求。
  • 通过Discovery Server发现可用的Server。
  • Coordinator构建查询计划(Connector插件提供Metadata)
  • Coordinator向workers发送任务
  • Worker通过Connector插件读取数据
  • Worker在内存里执行任务(Worker是纯内存型计算引擎)
  • Worker将数据返回给Coordinator,之后再Response Client

presto和mysql对比_第1张图片

3.优缺点

优点:

  • Ad-hoc,期望查询时间秒级或几分钟
  • 比Hive快10倍
  • 支持多数据源,如Hive、Kafka、MySQL、MonogoDB、Redis、JMX等,也可自己实现Connector
  • Client Protocol: HTTP+JSON, support various languages(Python, Ruby, PHP, Node.js Java)
  • 支持JDBC/ODBC连接r
  • ANSI SQL,支持窗口函数,join,聚合,复杂查询等

缺点:

  • No fault tolerance;当一个Query分发到多个Worker去执行时,当有一个Worker因为各种原因查询失败,那么Master会感知到,整个Query也就查询失败了,而Presto并没有重试机制,所以需要用户方实现重试机制。
  • Memory Limitations for aggregations, huge joins;比如多表join需要很大的内存,由于Presto是纯内存计算,所以当内存不够时,Presto并不会将结果dump到磁盘上,所以查询也就失败了,但最新版本的Presto已支持写磁盘操作,这个待后续测试和调研。
  • MPP(Massively Parallel Processing )架构;这个并不能说其是一个缺点,因为MPP架构就是解决大量数据分析而产生的,但是其缺点也很明显,假如我们访问的是Hive数据源,如果其中一台Worker由于load问题,数据处理很慢,那么整个查询都会受到影响,因为上游需要等待上游结果。

三、mysql与presto语句比较

数据量

mysql 语句

mysql执行时间

presto语句

presto执行时间 

22131166 SELECT count(*) FROM 表 (索引) 3.226 s SELECT count(*) FROM 表 (索引)

1.91 s+3.705 s

计算耗时 + 读耗时

173655288 SELECT count(*) FROM 表 (无索引) 1 m 39 s SELECT count(*) FROM 表 (无索引) 1.93 s +2 m 13 s
22131166

SELECT a.s, a.c
, concat(round(if(a.c = 0, 0, a.s * 100 / a.c), 2), '%')
FROM (
SELECT SUM(visit_type NOT IN ('***', '***')) AS s, COUNT(*) AS c
FROM  表
) a

10.715s

with a as (
select COUNT(*) AS num
FROM 表
),
b as (
select COUNT(*) AS num
FROM 表
where visit_type not in ('***', '***')
)
select a.num,b.num
from a,b

1.151 s +22.638 s
591842

SELECT filter.name, filter.num
, concat(round(if(total.num = 0, 0, filter.num / total.num), 2), '%')
FROM (
SELECT COUNT(*) AS num
FROM 表
WHERE 条件
) total, (
SELECT image_modality AS name, COUNT(*) AS num
FROM 表
WHERE 条件
GROUP BY image_modality
ORDER BY COUNT(*) DESC
LIMIT 20
) filter

876 ms

WITH a AS (
SELECT COUNT(*) AS num
FROM mysql2."46300194x_normaldb_1.3.1_lung_clean".image_record
WHERE 条件
),
b AS (
SELECT image_modality AS name, COUNT(*) AS num
FROM 表
WHERE条件
GROUP BY image_modality
ORDER BY COUNT(*) DESC
LIMIT 20
)
SELECT a.num, b.num, b.name
FROM a, b

1.154 s +321 ms
629022

SELECT keyword_count.keyword, keyword_count.row_count
, concat(round(if(total_count.row_count = 0, 0, keyword_count.row_count * 100 / total_count.row_count), 2), '%')
, keyword_count.patient_count
, concat(round(if(total_count.patient_count = 0, 0, keyword_count.patient_count * 100 / total_count.patient_count), 2), '%')
FROM (
SELECT '毒物' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%毒物%'
UNION
SELECT '射线' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%射线%'
UNION
SELECT '吸烟' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%吸烟%'
UNION
SELECT '过敏' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%过敏%'
UNION
SELECT '手术' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%手术%'
UNION
SELECT '外伤' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%外伤%'
)keyword_count, (
SELECT COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 1 = 1
) total_count

16.566 s

WITH keyword_count AS (
SELECT '毒物' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%毒物%'
UNION
SELECT '射线' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%射线%'
UNION
SELECT '吸烟' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%吸烟%'
UNION
SELECT '过敏' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%过敏%'
UNION
SELECT '手术' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%手术%'
UNION
SELECT '外伤' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%外伤%'
),
total_count AS (
SELECT COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 1 = 1
)
SELECT keyword_count.keyword, keyword_count.row_count, keyword_count.patient_count
FROM keyword_count

1.531 s +2.852 s

总结:

  • presto相比mysql,更适合复杂的SQL计算
  • presto每次执行时都会先读取再计算,因此简单计算会比直接使用mysql要慢

四、基于mysql 和hive 的presto sql性能对比

presto语句

mysql 数据量

mysql耗时

hive数据量

hive耗时

select count(*) from 表

22131166 1.294 s + 3.414 s 163612711 1.88 s + 303 ms
SELECT count(*) FROM 表 where 条件 173655288 12.907 s + 3 m 8 s 114279824 1.349 s + 3 ms
WITH keyword_count AS (
SELECT '毒物' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%毒物%'
UNION
SELECT '射线' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%射线%'
UNION
SELECT '吸烟' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%吸烟%'
UNION
SELECT '过敏' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%过敏%'
UNION
SELECT '手术' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%手术%'
UNION
SELECT '外伤' AS keyword, COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 字段 LIKE '%外伤%'
),
total_count AS (
SELECT COUNT(*) AS row_count, COUNT(DISTINCT patient_id) AS patient_count
FROM 表
WHERE 1 = 1
)
SELECT keyword_count.keyword, keyword_count.row_count, keyword_count.patient_count
FROM keyword_count
629022 1.510 s + 2.978 s 2627048 1.141 s +142 ms

 

总结:

  • hive比mysql在执行和查询上都要快很多

五、问题积累

1.presto 在执行时是分批处理还是一次性读完数据在处理?

presto自身是支持批处理的,但是mysql connect不支持,所以操作mysql是一次性读完再处理,hive是分批处理

 

2.为什么presto查hive比查mysql 快得多

处理hive时会分批读取及处理,处理mysql只能一次性先读完再处理

 

3.为什么一个presto sql 会被拆成多个相同的 mysql sql

presto拆分子查询后不会再合并下推的sql语句

你可能感兴趣的:(presto和mysql对比)