presto内存配置逻辑梳理

presto内存配置逻辑梳理

Presto分了两块内存池:GENERAL_POOL和RESERVED_POOL,而RESERVED_POOL根据业务实际情况,我会禁用,所以这块内存不做分配和考虑。

GENERAL_POOL中会分两类内存:user memory和system memory。system memory用于input/output/exchange buffers,存放实际读写的实际数据;user memory 用于hash join、agg等关联计算。相关配置说明如下:

注意:一个查询SQL可以通过语法树拆解成多个Query。

  • experimental.reserved-pool-enabled: 禁用RESERVED_POOL,设置为false

  • query.max-memory-per-node: 单个Query在单个Worker上允许的最大user memory

  • query.max-total-memory-per-node: 单个Query在单个Worker上允许的最大user memory + system memory

  • memory.heap-headroom-per-node:默认是Xmx 0.3,这个值与max-total-memory-per-node的和不能大于Xmx

//AVAILABLE_HEAP_MEMORY默认是Runtime.getRuntime().maxMemory(),也就是JVM Xmx
private DataSize heapHeadroom = new DataSize(AVAILABLE_HEAP_MEMORY * 0.3, BYTE);


//availableMemory默认是Runtime.getRuntime().maxMemory(),也就是JVM Xmx
static void validateHeapHeadroom(NodeMemoryConfig config, long availableMemory)
    {
        long maxQueryTotalMemoryPerNode = config.getMaxQueryTotalMemoryPerNode().toBytes();
        long heapHeadroom = config.getHeapHeadroom().toBytes();
        if (heapHeadroom < 0 || heapHeadroom + maxQueryTotalMemoryPerNode > availableMemory) {
            throw new IllegalArgumentException(
                    format("Invalid memory configuration. The sum of max total query memory per node (%s) and heap headroom (%s) cannot be larger than the available heap memory (%s)",
                            maxQueryTotalMemoryPerNode,
                            heapHeadroom,
                            availableMemory));
        }
    }
  • query.max-memory: 单个Query在整个集群上允许的最大user memory,一般是 query.max-memory-per-node * 一个query的执行并发数 * 0.8数据倾斜。

  • query.max-total-memory: 单个Query在整个集群上允许占用的最大user + system memory,不配置的话,默认是max-memory的2倍。

//com.facebook.presto.memory.MemoryManagerConfig
    @NotNull
    public DataSize getMaxQueryMemory()
    {
        return maxQueryMemory;
    }

    @Config("query.max-memory")
    public MemoryManagerConfig setMaxQueryMemory(DataSize maxQueryMemory)
    {
        this.maxQueryMemory = maxQueryMemory;
        return this;
    }

    @NotNull
    public DataSize getMaxQueryTotalMemory()
    {
        if (maxQueryTotalMemory == null) {
            return succinctBytes(maxQueryMemory.toBytes() * 2); //不配置就默认是maxQueryMemory的2倍
        }
        return maxQueryTotalMemory;
    }

    @Config("query.max-total-memory")
    public MemoryManagerConfig setMaxQueryTotalMemory(DataSize maxQueryTotalMemory)
    {
        this.maxQueryTotalMemory = maxQueryTotalMemory;
        return this;
    }
//com.facebook.presto.memory.ClusterMemoryManager
this.maxQueryMemory = config.getMaxQueryMemory(); //query.max-memory,不设置默认20G
this.maxQueryTotalMemory = config.getMaxQueryTotalMemory(); //query.max-total-memory, 不设置默认 2*max-memory

if (!resourceOvercommit) {
    //也可以通过查询session中的属性query_max_memory配置,取最小值
    long userMemoryLimit = min(maxQueryMemory.toBytes(), getQueryMaxMemory(query.getSession()).toBytes()); 
    if (userMemoryReservation > userMemoryLimit) { /
        query.fail(exceededGlobalUserLimit(succinctBytes(userMemoryLimit)));
        queryKilled = true;
    }

    //也可以通过查询session中的属性query_max_total_memory配置,取最小值
    long totalMemoryLimit = min(maxQueryTotalMemory.toBytes(), getQueryMaxTotalMemory(query.getSession()).toBytes());
    if (totalMemoryReservation > totalMemoryLimit) {
        query.fail(exceededGlobalTotalLimit(succinctBytes(totalMemoryLimit)));
        queryKilled = true;
    }
}
  • query.initial-hash-partitions: 控制单个Query的执行并发数,默认是100了,也要受限于worker的数据量,根据这个配置去算max-memory会比较好,网上很多例子都是用的8,但是如果集群节点很多,那配置8就有点小了,就不能很好的利用集群的资源。

//com.facebook.presto.sql.planner.SystemPartitioningHandle

public NodePartitionMap getNodePartitionMap(Session session, NodeScheduler nodeScheduler)
    {
        NodeSelector nodeSelector = nodeScheduler.createNodeSelector(session, null);
        List nodes;
        if (partitioning == SystemPartitioning.COORDINATOR_ONLY) {
            nodes = ImmutableList.of(nodeSelector.selectCurrentNode());
        }
        else if (partitioning == SystemPartitioning.SINGLE) {
            nodes = nodeSelector.selectRandomNodes(1);
        }
        else if (partitioning == SystemPartitioning.FIXED) {
            //这里
            nodes = nodeSelector.selectRandomNodes(min(getHashPartitionCount(session), getMaxTasksPerStage(session)));
        }
        else {
            throw new IllegalArgumentException("Unsupported plan distribution " + partitioning);
        }

        checkCondition(!nodes.isEmpty(), NO_NODES_AVAILABLE, "No worker nodes available");

        return new NodePartitionMap(nodes, split -> {
            throw new UnsupportedOperationException("System distribution does not support source splits");
        });
    }

因此得出这样一套配置,需要根据实际的业务场景优化和观察。

# Xmx 100G 
# 40个worker
http-server.http.port=8070
experimental.reserved-pool-enabled=false
memory.heap-headroom-per-node=20GB
query.max-total-memory-per-node=50GB
query.max-memory-per-node=40GB
query.max-memory=900GB
query.low-memory-killer.policy=total-reservation-on-blocked-nodes

你可能感兴趣的:(大数据,presto)