Block:数据块,ClickHouse进行数据读、写的基本单元,每一个Block实例,不仅包含数据域,还包含了每个列的meta信息。
Chunk:数据块,保存实际数据的单元,Block中的数据域的指向的就是这个类型的实例。
Row:一行记录,包含多个列索引,Chunk可以认为是由多个Row组成的。
Column:一列数据,包含一个列上的Block Size数量的行。
一个Block对象,可以简单理解为一张表,它的每一列都有相同的长度,每一行长度也等:
Block/Chunk | Column1 | Column2 | … | ColumnM |
---|---|---|---|---|
Row1 | value | value | … | value |
Row2 | value | value | … | value |
… | … | … | … | … |
RowN | value | value | … | value |
select uniq(B), uniq(A), uniq(C) from test_tbl
InputStream:从数据表读取数据,返回许多Block结构的对象,以便在内存中处理。
Insert:在每一个Block上的每一个列上,进行聚合函数调用,将原始字段值添加到中间数据结构中,在ClickHouse中称为State。
Merge:在每一个Block上的每一个列上,聚合所有的State。
Serialize:序列化State对象。
Deserialize:反序列化State对象。
Final:返回最终结果。
并行插入:通过配置max_streams参数可以控制读取线程数量,默认值为max_threads,但还可以通过max_streams_to_max_threads_ratio参数调整
InputStreamA -> BlockA -> Insert -> Serialize -> StateA
InputStreamA -> BlockA2 -> Insert -> Serialize -> StateA2
InputStreamB -> BlockB -> Insert -> Serialize -> StateB
InputStreamB -> BlockB2 -> Insert -> Serialize -> StateB2并行合并:并行粒度 = min(max_threads, 并行block数量)
(StateA -> BlockA, StateB -> BlockB) -> StateC
(StateA2 -> BlockA2, StateB2 -> BlockB2) -> StateC2计算最终结果:单线程
(StateC,State C2) -> insertResultInto -> Final
假如test_tbl
这个表是一个MergeTree类型的本地表,那么执行聚合操作时,会首先从磁盘上的Part文件读取数据,下面张贴的代码片段,展示了SELECT语句的读取的调用过程,顺序从前到后:
/// A Storage that allows reading from a single MergeTree data part.
/// 在SQL解释执行时,如果执行过程到达了FetchColumns阶段,则会调用
/// InterpreterSelectQuery::executeFetchColumns(...)方法,触发此Storage实例的read()操作。
class StorageFromMergeTreeDataPart final : public ext::shared_ptr_helper, public IStorage
{
friend struct ext::shared_ptr_helper;
public:
String getName() const override { return "FromMergeTreeDataPart"; }
Pipe read(
const Names & column_names,
const StorageMetadataPtr & metadata_snapshot,
SelectQueryInfo & query_info,
const Context & context,
QueryProcessingStage::Enum /*processed_stage*/,
size_t max_block_size,
unsigned num_streams) override
{
QueryPlan query_plan =
// 这里在生成一个新的QueryPlan时,并触发数据从文件到内存的加载过程,
// 然后返回一个Pipe对象,Pipe就是整个Pipeline中的一部分操作的集合。
std::move(*MergeTreeDataSelectExecutor(part->storage)
.readFromParts({part}, column_names, metadata_snapshot, query_info, context, max_block_size, num_streams));
return query_plan.convertToPipe();
}
}
创建一个MergeTree类型的、表的数据文件的读取执行器,最终通过此方法完成part文件的读取。
QueryPlanPtr MergeTreeDataSelectExecutor::readFromParts(
MergeTreeData::DataPartsVector parts,
const Names & column_names_to_return,
const StorageMetadataPtr & metadata_snapshot,
const SelectQueryInfo & query_info,
const Context & context,
const UInt64 max_block_size,
const unsigned num_streams,
const PartitionIdToMaxBlock * max_block_numbers_to_read) const
{}
MergeTreeDataSelectExecutor::readFromParts(...)
内部则是调用MergeTreeIndexReader::read()
方法将刚刚读取的数据进行反序列化。
/// 这个方法仅仅会根据设置的Granularity,来读取MergeTreeReaderStream中缓存的数据内容,并反序列化。
/// 这里说刚刚,是因为在实例化MergeTreeIndexReader对象时,会同时构造MergeTreeReaderStream实例,
/// 而在MergeTreeReaderStream初始化时,会触发读取文件的过程。
MergeTreeIndexGranulePtr MergeTreeIndexReader::read()
{
auto granule = index->createIndexGranule();
granule->deserializeBinary(*stream.data_buffer);
return granule;
}
当创建一个MergeTreeReaderStream
类的实例时,就会触发对磁盘文件的读取,一旦这个实例初始化完成,也就意味着当前Part文件的数据被加载到了内存中。
/// Class for reading a single column (or index).
/// 在这个类创建时,就会通过CompressedReadBufferFromFile对象,将一列数据从文件读取到内存
class MergeTreeReaderStream
{
public:
MergeTreeReaderStream(
DiskPtr disk_,
const String & path_prefix_,
const String & data_file_extension_,
size_t marks_count_,
const MarkRanges & all_mark_ranges,
const MergeTreeReaderSettings & settings_,
MarkCache * mark_cache,
UncompressedCache * uncompressed_cache,
size_t file_size,
const MergeTreeIndexGranularityInfo * index_granularity_info_,
const ReadBufferFromFileBase::ProfileCallback & profile_callback,
clockid_t clock_type);
void seekToMark(size_t index);
void seekToStart();
ReadBuffer * data_buffer;
private:
DiskPtr disk;
std::string path_prefix;
std::string data_file_extension;
size_t marks_count;
MarkCache * mark_cache;
bool save_marks_in_cache;
const MergeTreeIndexGranularityInfo * index_granularity_info;
std::unique_ptr cached_buffer;
std::unique_ptr non_cached_buffer;
MergeTreeMarksLoader marks_loader;
};
/// Unlike CompressedReadBuffer, it can do seek.
class CompressedReadBufferFromFile : public CompressedReadBufferBase, public BufferWithOwnMemory
{
private:
/** At any time, one of two things is true:
* a) size_compressed = 0
* b)
* - `working_buffer` contains the entire block.
* - `file_in` points to the end of this block.
* - `size_compressed` contains the compressed size of this block.
*/
std::unique_ptr p_file_in;
ReadBufferFromFileBase & file_in;
size_t size_compressed = 0;
bool nextImpl() override;
public:
CompressedReadBufferFromFile(std::unique_ptr buf, bool allow_different_codecs_ = false);
CompressedReadBufferFromFile(
const std::string & path, size_t estimated_size, size_t aio_threshold, size_t mmap_threshold,
size_t buf_size = DBMS_DEFAULT_BUFFER_SIZE, bool allow_different_codecs_ = false);
void seek(size_t offset_in_compressed_file, size_t offset_in_decompressed_block);
size_t readBig(char * to, size_t n) override;
void setProfileCallback(const ReadBufferFromFileBase::ProfileCallback & profile_callback_, clockid_t clock_type_ = CLOCK_MONOTONIC_COARSE)
{
file_in.setProfileCallback(profile_callback_, clock_type_);
}
};
MergeTreeIndexReader
在MergeTreeSequentialSource.h
中实现的generate()方法
中,调用MergeTreeReaderCompact.h
中实现的readRows()
方法,完成数据从文件到内存的读取,最终返回一个Chunk对象。
// MergeTreeSequentialSource实际上是一个IProcessor的实现类类型,就是一个物理流水线上的最起始算子,因此我们在执行Query语句时,
// 如果数据源是本地文件,总是先执行这个算子,也就是generate()方法,产生数据。
Chunk MergeTreeSequentialSource::generate()
try
{
const auto & header = getPort().getHeader();
if (!isCancelled() && current_row < data_part->rows_count)
{
// 找到待读取数据的长度
size_t rows_to_read = data_part->index_granularity.getMarkRows(current_mark);
bool continue_reading = (current_mark != 0);
// 如果读取的是Compact文件,则调用MergeTreeReaderCompact实例,尝试获取第一行的记录,就是读取多个少列
const auto & sample = reader->getColumns();
// 创建一个列索引数组,以保存每一列数据的地址,每一列都是连续存放的
Columns columns(sample.size());
size_t rows_read = reader->readRows(current_mark, continue_reading, rows_to_read, columns);
// 如果读取了数据
if (rows_read)
{
current_row += rows_read;
current_mark += (rows_to_read == rows_read);
bool should_evaluate_missing_defaults = false;
reader->fillMissingColumns(columns, should_evaluate_missing_defaults, rows_read);
if (should_evaluate_missing_defaults)
{
reader->evaluateMissingDefaults({}, columns);
}
reader->performRequiredConversions(columns);
/// Reorder columns and fill result block.
size_t num_columns = sample.size();
Columns res_columns;
res_columns.reserve(num_columns);
// 根据SQL语句中指定的待读取字段名字,过滤掉不需要的列
auto it = sample.begin();
for (size_t i = 0; i < num_columns; ++i)
{
if (header.has(it->name))
res_columns.emplace_back(std::move(columns[i]));
++it;
}
// 创建一个Chunck实例,保存了N行M列数据,返回给上层调用者
return Chunk(std::move(res_columns), rows_read);
}
}
else
{
finish();
}
return {};
}
对于定义了聚合操作的Query语句,这里包含三个聚合参数,uniq(A)、uniq(B)、uniq©,ClickHouse在生成物理计划树时,会创建一个叫AggregatingTransform
的算子,它的构造函数定义如下:
class AggregatingTransform : public IProcessor
{
public:
AggregatingTransform(Block header, AggregatingTransformParamsPtr params_);
/// For Parallel aggregating.
AggregatingTransform(Block header, AggregatingTransformParamsPtr params_,
ManyAggregatedDataPtr many_data, size_t current_variant,
size_t max_threads, size_t temporary_data_merge_threads);
~AggregatingTransform() override;
}
可以看到这个算子包含了待聚合的数据header;所有的聚合操作params_;聚合操作最大并行度max_threads等。
其中params_
是一个AggregatingTransformParams
结构体的指针对象,而AggregatingTransformParams包含着聚合操作中的关键一个对象Aggregator
,其类定义如下:
struct AggregatingTransformParams
{
Aggregator::Params params;
// 所有聚合操作的细节都在这个类中定义,因此后续插入、聚合等过程,也就直接从这个类开始讲起
Aggregator aggregator;
bool final;
AggregatingTransformParams(const Aggregator::Params & params_, bool final_)
: params(params_), aggregator(params), final(final_) {}
Block getHeader() const { return aggregator.getHeader(final); }
Block getCustomHeader(bool final_) const { return aggregator.getHeader(final_); }
};
Aggregator
类的构造:
Aggregator::Aggregator(const Params & params_)
: params(params_),
isCancelled([]() { return false; })
{
/// Use query-level memory tracker
if (auto * memory_tracker_child = CurrentThread::getMemoryTracker())
if (auto * memory_tracker = memory_tracker_child->getParent())
memory_usage_before_aggregation = memory_tracker->get();
/// aggregate_functions:数组保存了所有聚合函数指针
aggregate_functions.resize(params.aggregates_size);
for (size_t i = 0; i < params.aggregates_size; ++i)
aggregate_functions[i] = params.aggregates[i].function.get();
/// Initialize sizes of aggregation states and its offsets.
/// 每一个聚合函数都对应一个aggregate state的实例,为了更高效的存储与检索,
/// 这些state实例,会被序列化并顺序存放在Arena的内存池中,可以认为一段连续的字节数组
offsets_of_aggregate_states.resize(params.aggregates_size);
total_size_of_aggregate_states = 0;
all_aggregates_has_trivial_destructor = true;
// aggregate_states will be aligned as below:
// |<-- state_1 -->|<-- pad_1 -->|<-- state_2 -->|<-- pad_2 -->| .....
//
// pad_N will be used to match alignment requirement for each next state.
// The address of state_1 is aligned based on maximum alignment requirements in states
for (size_t i = 0; i < params.aggregates_size; ++i)
{
offsets_of_aggregate_states[i] = total_size_of_aggregate_states;
// 统计所有的aggregate state对象的大小,仅仅包含POD成员变量的大小
total_size_of_aggregate_states += params.aggregates[i].function->sizeOfData();
// aggregate states are aligned based on maximum requirement
// 记录最大的对齐长度
align_aggregate_states = std::max(align_aggregate_states, params.aggregates[i].function->alignOfData());
// If not the last aggregate_state, we need pad it so that next aggregate_state will be aligned.
if (i + 1 < params.aggregates_size)
{
size_t alignment_of_next_state = params.aggregates[i + 1].function->alignOfData();
if ((alignment_of_next_state & (alignment_of_next_state - 1)) != 0)
throw Exception("Logical error: alignOfData is not 2^N", ErrorCodes::LOGICAL_ERROR);
/// Extend total_size to next alignment requirement
/// Add padding by rounding up 'total_size_of_aggregate_states' to be a multiplier of alignment_of_next_state.
total_size_of_aggregate_states = (total_size_of_aggregate_states + alignment_of_next_state - 1) / alignment_of_next_state * alignment_of_next_state;
}
if (!params.aggregates[i].function->hasTrivialDestructor())
all_aggregates_has_trivial_destructor = false;
}
method_chosen = chooseAggregationMethod();
HashMethodContext::Settings cache_settings;
cache_settings.max_threads = params.max_threads;
/// 根据聚合时的key的数量以及类型,选择合适的聚合过程,例如使用two-level两阶段聚合,
/// 或是single-level一阶段聚合
aggregation_state_cache = AggregatedDataVariants::createCache(method_chosen, cache_settings);
}
从源的数据文件中读取数据后,就调用Aggregator::execute(...)
方法,在每一个源数据Block上进行每个列的聚合,这个过程是对于每一具stream流,都是单线程执行的。
void Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)
{
if (isCancelled())
return;
ColumnRawPtrs key_columns(params.keys_size);
AggregateColumns aggregate_columns(params.aggregates_size);
/** Used if there is a limit on the maximum number of rows in the aggregation,
* and if group_by_overflow_mode == ANY.
* In this case, new keys are not added to the set, but aggregation is performed only by
* keys that have already managed to get into the set.
*/
bool no_more_keys = false;
LOG_TRACE(log, "Aggregating");
Stopwatch watch;
size_t src_rows = 0;
size_t src_bytes = 0;
/// Read all the data
while (Block block = stream->read())
{
if (isCancelled())
return;
src_rows += block.rows();
src_bytes += block.bytes();
// 遍历Block中的所有行,并调用每一列的聚合方法,将数据插入到state对象中
if (!executeOnBlock(block, result, key_columns, aggregate_columns, no_more_keys))
break;
}
/// If there was no data, and we aggregate without keys, and we must return single row with the result of empty aggregation.
/// To do this, we pass a block with zero rows to aggregate.
if (result.empty() && params.keys_size == 0 && !params.empty_result_for_aggregation_by_empty_set)
executeOnBlock(stream->getHeader(), result, key_columns, aggregate_columns, no_more_keys);
double elapsed_seconds = watch.elapsedSeconds();
size_t rows = result.sizeWithoutOverflowRow();
LOG_TRACE(log, "Aggregated. {} to {} rows (from {}) in {} sec. ({} rows/sec., {}/sec.)",
src_rows, rows, ReadableSize(src_bytes),
elapsed_seconds, src_rows / elapsed_seconds,
ReadableSize(src_bytes / elapsed_seconds));
}
调用Aggregator::execute(...)
方法进行聚合的过程,称为Inserting,每一个stream中读入的每一个Block都会生成一个中间状态集合Block。但为了能够得到最终的结果,我们还需要将所有流上的、所有中间状态Block再在某个结点上聚合,最终只产生一个Block,这个过程称为Partially Merging,过程的入口是Aggregator::mergeStream(...)
,代码如下:
void Aggregator::mergeStream(const BlockInputStreamPtr & stream, AggregatedDataVariants & result, size_t max_threads)
{
if (isCancelled())
return;
/** If the remote servers used a two-level aggregation method,
* then blocks will contain information about the number of the bucket.
* Then the calculations can be parallelized by buckets.
* We decompose the blocks to the bucket numbers indicated in them.
*/
BucketToBlocks bucket_to_blocks;
/// Read all the data.
LOG_TRACE(log, "Reading blocks of partially aggregated data.");
size_t total_input_rows = 0;
size_t total_input_blocks = 0;
while (Block block = stream->read())
{
if (isCancelled())
return;
total_input_rows += block.rows();
++total_input_blocks;
bucket_to_blocks[block.info.bucket_num].emplace_back(std::move(block));
}
LOG_TRACE(log, "Read {} blocks of partially aggregated data, total {} rows.", total_input_blocks, total_input_rows);
/// 合并这个流中产生的所有中间状态,这个过程是可以并行的,因为每一个列上的操作都是独立的
mergeBlocks(bucket_to_blocks, result, max_threads);
}
/// bucket_to_blocks: 是一个Map的数据结构对象。如果无端使用了two-level聚合过程,那么说明有些block可以并行处理,
/// 我们可以根据设置的max_threads数量,适当地并行化执行;否则那些不能够并行处理的Block,被放置在key为-1的位置,
// 这些Block只能单线程地处理。
void Aggregator::mergeBlocks(BucketToBlocks bucket_to_blocks, AggregatedDataVariants & result, size_t max_threads)
{
if (bucket_to_blocks.empty())
return;
UInt64 total_input_rows = 0;
for (auto & bucket : bucket_to_blocks)
for (auto & block : bucket.second)
total_input_rows += block.rows();
/** `minus one` means the absence of information about the bucket
* - in the case of single-level aggregation, as well as for blocks with "overflowing" values.
* If there is at least one block with a bucket number greater or equal than zero, then there was a two-level aggregation.
*/
auto max_bucket = bucket_to_blocks.rbegin()->first;
bool has_two_level = max_bucket >= 0;
if (has_two_level)
{
#define M(NAME) \
if (method_chosen == AggregatedDataVariants::Type::NAME) \
method_chosen = AggregatedDataVariants::Type::NAME ## _two_level;
APPLY_FOR_VARIANTS_CONVERTIBLE_TO_TWO_LEVEL(M)
#undef M
}
if (isCancelled())
return;
/// result will destroy the states of aggregate functions in the destructor
result.aggregator = this;
result.init(method_chosen);
result.keys_size = params.keys_size;
result.key_sizes = key_sizes;
bool has_blocks_with_unknown_bucket = bucket_to_blocks.count(-1);
/// First, parallel the merge for the individual buckets. Then we continue merge the data not allocated to the buckets.
if (has_two_level)
{
/** In this case, no_more_keys is not supported due to the fact that
* from different threads it is difficult to update the general state for "other" keys (overflows).
* That is, the keys in the end can be significantly larger than max_rows_to_group_by.
*/
LOG_TRACE(log, "Merging partially aggregated two-level data.");
auto merge_bucket = [&bucket_to_blocks, &result, this](Int32 bucket, Arena * aggregates_pool, ThreadGroupStatusPtr thread_group)
{
if (thread_group)
CurrentThread::attachToIfDetached(thread_group);
for (Block & block : bucket_to_blocks[bucket])
{
if (isCancelled())
return;
#define M(NAME) \
else if (result.type == AggregatedDataVariants::Type::NAME) \
mergeStreamsImpl(block, aggregates_pool, *result.NAME, result.NAME->data.impls[bucket], nullptr, false);
if (false) {} // NOLINT
APPLY_FOR_VARIANTS_TWO_LEVEL(M)
#undef M
else
throw Exception("Unknown aggregated data variant.", ErrorCodes::UNKNOWN_AGGREGATED_DATA_VARIANT);
}
};
std::unique_ptr thread_pool;
if (max_threads > 1 && total_input_rows > 100000) /// TODO Make a custom threshold.
thread_pool = std::make_unique(max_threads);
for (const auto & bucket_blocks : bucket_to_blocks)
{
const auto bucket = bucket_blocks.first;
if (bucket == -1)
continue;
// 尝试并行地处理当前Block数据,每一个Block都会对应一个单独的Arema内存池,用于存放聚合后的结果
result.aggregates_pools.push_back(std::make_shared());
Arena * aggregates_pool = result.aggregates_pools.back().get();
auto task = [group = CurrentThread::getGroup(), bucket, &merge_bucket, aggregates_pool]{ return merge_bucket(bucket, aggregates_pool, group); };
if (thread_pool)
thread_pool->scheduleOrThrowOnError(task);
else
task();
}
if (thread_pool)
thread_pool->wait();
LOG_TRACE(log, "Merged partially aggregated two-level data.");
}
if (isCancelled())
{
result.invalidate();
return;
}
// 如果发现有不能并行处理的Block,则采用single-level方式,单线程处理。
if (has_blocks_with_unknown_bucket)
{
LOG_TRACE(log, "Merging partially aggregated single-level data.");
bool no_more_keys = false;
BlocksList & blocks = bucket_to_blocks[-1];
for (Block & block : blocks)
{
if (isCancelled())
{
result.invalidate();
return;
}
if (!checkLimits(result.sizeWithoutOverflowRow(), no_more_keys))
break;
if (result.type == AggregatedDataVariants::Type::without_key || block.info.is_overflows)
mergeWithoutKeyStreamsImpl(block, result);
#define M(NAME, IS_TWO_LEVEL) \
else if (result.type == AggregatedDataVariants::Type::NAME) \
mergeStreamsImpl(block, result.aggregates_pool, *result.NAME, result.NAME->data, result.without_key, no_more_keys);
APPLY_FOR_AGGREGATED_VARIANTS(M)
#undef M
else if (result.type != AggregatedDataVariants::Type::without_key)
throw Exception("Unknown aggregated data variant.", ErrorCodes::UNKNOWN_AGGREGATED_DATA_VARIANT);
}
LOG_TRACE(log, "Merged partially aggregated single-level data.");
}
}
最终,在完成了Partially Merging后,就将调用如下的方法,完成所有中间数据的聚合,最终返回一个仅包含了最终结果的一个Block。
/** Convert the aggregation data structure into a block.
* If overflow_row = true, then aggregates for rows that are not included in max_rows_to_group_by are put in the first block.
*
* If final = false, then ColumnAggregateFunction is created as the aggregation columns with the state of the calculations,
* which can then be combined with other states (for distributed query processing).
* If final = true, then columns with ready values are created as aggregate columns.
*/
BlocksList Aggregator::convertToBlocks(AggregatedDataVariants & data_variants, bool final, size_t max_threads) const
{
if (isCancelled())
return BlocksList();
LOG_TRACE(log, "Converting aggregated data to blocks");
Stopwatch watch;
BlocksList blocks;
/// In what data structure is the data aggregated?
if (data_variants.empty())
return blocks;
std::unique_ptr thread_pool;
if (max_threads > 1 && data_variants.sizeWithoutOverflowRow() > 100000 /// TODO Make a custom threshold.
&& data_variants.isTwoLevel()) /// TODO Use the shared thread pool with the `merge` function.
thread_pool = std::make_unique(max_threads);
if (isCancelled())
return BlocksList();
if (data_variants.without_key)
blocks.emplace_back(prepareBlockAndFillWithoutKey(
data_variants, final, data_variants.type != AggregatedDataVariants::Type::without_key));
if (isCancelled())
return BlocksList();
if (data_variants.type != AggregatedDataVariants::Type::without_key)
{
if (!data_variants.isTwoLevel())
blocks.emplace_back(prepareBlockAndFillSingleLevel(data_variants, final));
else
blocks.splice(blocks.end(), prepareBlocksAndFillTwoLevel(data_variants, final, thread_pool.get()));
}
if (!final)
{
/// data_variants will not destroy the states of aggregate functions in the destructor.
/// Now ColumnAggregateFunction owns the states.
data_variants.aggregator = nullptr;
}
if (isCancelled())
return BlocksList();
size_t rows = 0;
size_t bytes = 0;
for (const auto & block : blocks)
{
rows += block.rows();
bytes += block.bytes();
}
double elapsed_seconds = watch.elapsedSeconds();
LOG_TRACE(log,
"Converted aggregated data to blocks. {} rows, {} in {} sec. ({} rows/sec., {}/sec.)",
rows, ReadableSize(bytes),
elapsed_seconds, rows / elapsed_seconds,
ReadableSize(bytes / elapsed_seconds));
return blocks;
}