ClickHouse 为什么这么快?

本文从原理层的角度解释了 ClickHouse 为什么在单表查询领域这么快:
Why ClickHouse so Fast!

摘录如下:

Attention to Low-Level Details​

But many other database management systems use similar techniques. What really makes ClickHouse stand out is attention to low-level details. Most programming languages provide implementations for most common algorithms and data structures, but they tend to be too generic to be effective. Every task can be considered as a landscape with various characteristics, instead of just throwing in random implementation. For example, if you need a hash table, here are some key questions to consider:

  • Which hash function to choose?
  • Collision resolution algorithm: open addressing vs chaining?
  • Memory layout: one array for keys and values or separate arrays? Will it store small or large values?
  • Fill factor: when and how to resize? How to move values around on resize?
  • Will values be removed and which algorithm will work better if they will?
  • Will we need fast probing with bitmaps, inline placement of string keys, support for non-movable values, prefetch, and batching?

Hash table is a key data structure for GROUP BY implementation and ClickHouse automatically chooses one of 30+ variations for each specific query.

The same goes for algorithms, for example, in sorting you might consider:

  • What will be sorted: an array of numbers, tuples, strings, or structures?
  • Is all data available completely in RAM?
  • Do we need a stable sort?
  • Do we need a full sort? Maybe partial sort or n-th element will suffice?
  • How to implement comparisons?
  • Are we sorting data that has already been partially sorted?

Algorithms that they rely on characteristics of data they are working with can often do better than their generic counterparts. If it is not really known in advance, the system can try various implementations and choose the one that works best in runtime. For example, see an article on how LZ4 decompression is implemented in ClickHouse.

Last but not least, the ClickHouse team always monitors the Internet on people claiming that they came up with the best implementation, algorithm, or data structure to do something and tries it out. Those claims mostly appear to be false, but from time to time you’ll indeed find a gem.

下面用一个Aggreate(src/Interpreters/Aggregator.cpp)的实现例子看看它是怎么做 Batch 执行的:



template 
struct HashTableCell
{
    using State = TState;

    using key_type = Key;
    using value_type = Key;
    using mapped_type = VoidMapped;

    Key key;
};

class HashTable
{
public:
    using key_type = Key;
    using mapped_type = typename Cell::mapped_type;
    using value_type = typename Cell::value_type;
    using cell_type = Cell;
};

template <
    typename Key,
    typename Cell,
    typename Hash = DefaultHash,
    typename Grower = HashTableGrowerWithPrecalculation<>,
    typename Allocator = HashTableAllocator>
class HashMapTable : public HashTable
{
public:
    using Self = HashMapTable;
    using Base = HashTable;
    using LookupResult = typename Base::LookupResult;
    using Iterator = typename Base::iterator;

    using Base::Base;
    using Base::prefetch;
};


template <
    typename Key,
    typename Mapped,
    typename Hash = DefaultHash,
    typename Grower = HashTableGrowerWithPrecalculation<>,
    typename Allocator = HashTableAllocator>
using HashMap = HashMapTable, Hash, Grower, Allocator>;



using AggregatedDataWithUInt64Key = HashMap>;

struct AggregatedDataVariants : private boost::noncopyable
{
    std::unique_ptr>           key8;
    std::unique_ptr>         key16;

    std::unique_ptr>         key32;
    std::unique_ptr>         key64;
    std::unique_ptr>               key_string;
    std::unique_ptr>          key_fixed_string;

    ....
};


void Aggregator::executeImpl(
    AggregatedDataVariants & result,
    size_t row_begin,
    size_t row_end,
    ColumnRawPtrs & key_columns,
    AggregateFunctionInstruction * aggregate_instructions,
    bool no_more_keys,
    AggregateDataPtr overflow_row) const
{
    if (false) {} // NOLINT

    else if (result.type == AggregatedDataVariants::Type::key8)
            executeImpl(*result.key8, result.aggregates_pool, row_begin, row_end, key_columns, aggregate_instructions, no_more_keys, overflow_row);

    else if (result.type == AggregatedDataVariants::Type::key16)
            executeImpl(*result.key32, result.aggregates_pool, row_begin, row_end, key_columns, aggregate_instructions, no_more_keys, overflow_row);

    else if (result.type == AggregatedDataVariants::Type::key32)
            executeImpl(*result.key32, result.aggregates_pool, row_begin, row_end, key_columns, aggregate_instructions, no_more_keys, overflow_row);

    else if (result.type == AggregatedDataVariants::Type::key64)
            executeImpl(*result.key64, result.aggregates_pool, row_begin, row_end, key_columns, aggregate_instructions, no_more_keys, overflow_row);

    ....
}

// 接下来的函数以 key64 为例子说明一些模板参数类型
// key64 时下面的 Method 就是 AggregatedDataVariants::key64
// 即: std::unique_ptr>         key64;

// AggregationMethodOneNumber 定义很简单,如下:

/// For the case where there is one numeric key.
/// FieldType is UInt8/16/32/64 for any type with corresponding bit width.
template 
struct AggregationMethodOneNumber
{
    using Data = TData;
    using Key = typename Data::key_type;
    using Mapped = typename Data::mapped_type;

    Data data;

    AggregationMethodOneNumber() = default;

    explicit AggregationMethodOneNumber(size_t size_hint) : data(size_hint) { }

    template 
    explicit AggregationMethodOneNumber(const Other & other) : data(other.data)
    {
    }

    /// To use one `Method` in different threads, use different `State`.
    using State = ColumnsHashing::HashMethodOneNumber;

    /// Use optimization for low cardinality.
    static const bool low_cardinality_optimization = false;

    /// Shuffle key columns before `insertKeyIntoColumns` call if needed.
    std::optional shuffleKeyColumns(std::vector &, const Sizes &) { return {}; }

    // Insert the key from the hash table into columns.
    static void insertKeyIntoColumns(const Key & key, std::vector & key_columns, const Sizes & /*key_sizes*/)
    {
        const auto * key_holder = reinterpret_cast(&key);
        auto * column = static_cast(key_columns[0]);
        column->insertRawData(key_holder);
    }
};



template 
void NO_INLINE Aggregator::executeImpl(
    Method & method,
    Arena * aggregates_pool,
    size_t row_begin,
    size_t row_end,
    ColumnRawPtrs & key_columns,
    AggregateFunctionInstruction * aggregate_instructions,
    bool no_more_keys,
    AggregateDataPtr overflow_row) const
{
   Method = AggregationMethodOneNumber
   Method::State = ColumnsHashing::HashMethodOneNumber;
   // AggregatedDataWithUInt64Key::value_type 参见 HashTable::value_type
   // AggregatedDataWithUInt64Key::mapped_type 参见 HashTable::mapped_type
   // 其实就是 Hash 表的 key-value 数据类型

   method.data = AggregatedDataWithUInt64Key = HashTable
   method.data.data() = FixedHashTable::data() // 这个仅在 int8,int16 场景使用,HashTable 没有本成员

}

 

CK 的一个 Hash 算法(Common/HashTable/HashTable.h)如下,这个文件里的 HASH 算法都没有使用链地址法(Bucket)来处理 HASH 冲突,而是使用开放定址法 + 线性探查。该方法的好处是 NDV 较小时可以减少一次寻址(从 Bucket 到 Cell)。

常见 HASH 冲突处理算法:

* 开放定址法

* 再哈希法

* 链地址法

* 建立公共溢出区

解决哈希冲突的常用方法分析 - 腾讯云开发者社区-腾讯云

    // buf 是一个 2^N 大小的 buffer,用于存放 Cell 数据结构,Cell 里包含了元素值
    // grower 是对这个 buf 的逻辑封装,记录了 buf 的大小,支持扩展 buf 为 2^(N+1)
    // 
    // 初始空 hash 表状态下,每个 buf 元素都是 zero 状态
    // 一个元素到来时,首先计算该元素的 hash 值 hash_value,然后用 hash_value 值模 2^N,
    // 得到元素放置位置 place_value。
    // 考虑到可能存在 hash 冲突,单次取模可能找不到对应的元素,此时会使用 hash 
    // 冲突算法 (grower.next)寻找下一个可能的槽位。

    bool ALWAYS_INLINE has(const Key & x) const
    {
        if (Cell::isZero(x, *this))
            return this->hasZero();

        size_t hash_value = hash(x);
        size_t place_value = findCell(x, hash_value, grower.place(hash_value));
        return !buf[place_value].isZero(*this);
    }

    // hash 冲突处理算法非常简单,常见的几个实现如下:
    struct HashTableFixedGrower
    {
        size_t next(size_t pos) const        { return pos + 1; }
    };

    class alignas(64) HashTableGrowerWithPrecalculation
    {
        /// The next cell in the collision resolution chain.
        size_t next(size_t pos) const { return (pos + 1) & precalculated_mask; }
    };

    struct HashTableGrower
    {
        /// The next cell in the collision resolution chain.
        size_t next(size_t pos) const        { ++pos; return pos & mask(); }
    };


    // hash 定位算法也很简单
    struct HashTableFixedGrower
    {
        size_t place(size_t x) const         { return x; }
    };

    struct HashTableGrower
    {
        /// From the hash value, get the cell number in the hash table.
        size_t place(size_t x) const         { return x & mask(); }
    };

    class alignas(64) HashTableGrowerWithPrecalculation
    {
        /// From the hash value, get the cell number in the hash table.
        size_t place(size_t x) const { return x & precalculated_mask; }
    };


    // hash 查找的核心函数:
    // 如果 place_value 对应的位置为 zero cell,则返回 place_value
    // 如果 place_value 里的值和 x 不等,则说明有冲突发生,继续查找下一个可能位置
    /// Find a cell with the same key or an empty cell, starting from the specified position and further along the collision resolution chain.
    size_t ALWAYS_INLINE findCell(const Key & x, size_t hash_value, size_t place_value) const
    {
        while (!buf[place_value].isZero(*this) && !buf[place_value].keyEquals(x, hash_value, *this))
        {
            place_value = grower.next(place_value);
            ++collisions;
        }

        return place_value;
    }

CK 也有链地址法(bucket)版本的实现,在 TwoLevelHashTable.h 文件中。

你可能感兴趣的:(数据库技术,clickhouse)