点击上方“小强的进阶之路”,选择“星标”公众号
优质文章,及时送达
预计阅读时间: 11分钟
统计数据分布size,便于预测一些变量,如每行字节数,最大列每行字节数,filter比例等。方便计算出将要读取的最佳行数。
这个类目前版本看比较鸡肋,block_size_bytes,block_size_rows等变量在每次开始读取时重置为0,读取完成后又update这几个变量,看起来毫无意义。
三个在每次读取前有效参与到预测的参数:bytes_per_row_current,max_size_per_row_dynamic,filtered_rows_ratio。
这三个参数每次读取数据后,都会update对其进行更新,方便下次更好的预测。
每个列的 bytes_per_row ,在每次update时,根据本次局部的 local_bytes_per_row 和参数计算结果去更新,作为max_size_per_row_dynamic的依据。
double alpha = std::pow(1. - decay, diff_rows);
double local_bytes_per_row = static_cast(diff_size) / diff_rows;
info.bytes_per_row = alpha * info.bytes_per_row + (1. - alpha) * local_bytes_per_row;
max_size_per_row_dynamic = std::max(max_size_per_row_dynamic, info.bytes_per_row);
bytes_per_row_current是每个预测完的 info.bytes_per_row 相加等到的,info.bytes_per_row可以看作是每列的数据,bytes_per_row_current是所有列的和
bytes_per_row_current += info.bytes_per_row;
filtered_rows_ratio,根据每次读取的 current_ration 和参数计算结果更新filtered_rows_ratio
double alpha = std::pow(1. - decay, rows_was_read);
double current_ration = rows_was_filtered / std::max(1.0, static_cast(rows_was_read));
filtered_rows_ratio = current_ration < filtered_rows_ratio
? current_ration
: alpha * filtered_rows_ratio + (1.0 - alpha) * current_ration;
可以看到更新统计数据(data)的公式相同:
data = alpha * data + (1.0 - alpha) * current_data
那么核心在于 alpha,(1减去衰减因子) 的 (本次读取行数) 次方
double alpha = std::pow(1. - decay, diff_rows);
衰减因子:
/// Aggressiveness of bytes_per_row updates. See update() implementation.
/// After n=NUM_UPDATES_TO_TARGET_WEIGHT updates v_{n} = (1 - TARGET_WEIGHT) * v_{0} + TARGET_WEIGHT * v_{target}
static constexpr double TARGET_WEIGHT = 0.5;
static constexpr size_t NUM_UPDATES_TO_TARGET_WEIGHT = 8192;
static double DECAY() { return 1. - std::pow(TARGET_WEIGHT, 1. / NUM_UPDATES_TO_TARGET_WEIGHT); } /// 0.000084609113387
相关配置参数:
M(SettingUInt64, max_block_size, DEFAULT_BLOCK_SIZE, "Maximum block size for reading") \
M(SettingUInt64, preferred_block_size_bytes, 1000000, "") \
M(SettingUInt64, preferred_max_column_in_block_size_bytes, 0, "Limit on max column size in block while reading. Helps to decrease cache misses count. Should be close to L2 cache size.") \
/** Which blocks by default read the data (by number of rows).
* Smaller values give better cache locality, less consumption of RAM, but more overhead to process the query.
*/
#define DEFAULT_BLOCK_SIZE 65536
max_block_size :默认最大block 行数 65536 行
preferred_block_size_bytes:最佳的block字节数 1MB
preferred_max_column_in_block_size_bytes:block中最大列最佳字节数
bytes_per_row_current,max_size_per_row_dynamic,filtered_rows_ratio
bytes_per_row_current 在设置了 preferred_block_size_bytes 参数时起作用
max_size_per_row_dynamic,filtered_rows_ratio 都是在设置了 preferred_max_column_in_block_size_bytes 参数时起作用
auto estimateNumRows = [current_preferred_block_size_bytes, current_max_block_size_rows,
&index_granularity, current_preferred_max_column_in_block_size_bytes, min_filtration_ratio](
MergeTreeReadTask & current_task, MergeTreeRangeReader & current_reader)
{
if (!current_task.size_predictor)
return static_cast(current_max_block_size_rows);
/// Calculates number of rows will be read using preferred_block_size_bytes.
/// Can't be less than avg_index_granularity.
size_t rows_to_read = current_task.size_predictor->estimateNumRows(current_preferred_block_size_bytes); /// bytes_per_row_current 起作用
if (!rows_to_read)
return rows_to_read;
auto total_row_in_current_granule = current_reader.numRowsInCurrentGranule();
rows_to_read = std::max(total_row_in_current_granule, rows_to_read);
if (current_preferred_max_column_in_block_size_bytes) /// max_size_per_row_dynamic,filtered_rows_ratio 起作用
{
/// Calculates number of rows will be read using preferred_max_column_in_block_size_bytes.
auto rows_to_read_for_max_size_column
= current_task.size_predictor->estimateNumRowsForMaxSizeColumn(current_preferred_max_column_in_block_size_bytes);
double filtration_ratio = std::max(min_filtration_ratio, 1.0 - current_task.size_predictor->filtered_rows_ratio);
auto rows_to_read_for_max_size_column_with_filtration
= static_cast(rows_to_read_for_max_size_column / filtration_ratio);
/// If preferred_max_column_in_block_size_bytes is used, number of rows to read can be less than current_index_granularity.
rows_to_read = std::min(rows_to_read, rows_to_read_for_max_size_column_with_filtration);
}
auto unread_rows_in_current_granule = current_reader.numPendingRowsInCurrentGranule();
if (unread_rows_in_current_granule >= rows_to_read)
return rows_to_read;
return index_granularity.countMarksForRows(current_reader.currentMark(), rows_to_read, current_reader.numReadRowsInCurrentGranule());
};
End
推荐阅读:
工作中一些原则体会
程序员因接外包坐牢 456 天!两万字长文揭露心酸真实经历
清华大学两名博士生被开除:你不吃学习的苦,就要吃生活的苦
明天见(。・ω・。)ノ♡