三种列式存储源代码分析读取差异
tablename : t
列信息:A,B,C三列,列数据量相同,类型为string
表数据量大小:300G,每列数据100G
主要分析两种类型的sql对于不同存储格式的读取数据量大小
Q1:select count(distinct A) from t;
Q2:select * from t where A='1';
1. PARQUET
1.1 结构图
说明:最小行级别块为:row group
1.2 读取parquet代码解析
- 读取数据时需要传入schema信息,schema信息包含列信息
ParquetFileReader parquetFileReader = new ParquetFileReader(configuration, parquetFilePath,
readFooter.getBlocks(), schema.getColumns());
PageReadStore pages = null;
while (null != (pages = parquetFileReader.readNextRowGroup())) {
- 通过ParquetFileReader.readNextRowGroup可以看出来,根据schema信息组装列信息(无索引利用情况)
public PageReadStore readNextRowGroup() throws IOException {
if (this.currentBlock == this.blocks.size()) {
return null;
} else {
BlockMetaData block = (BlockMetaData)this.blocks.get(this.currentBlock);
if (block.getRowCount() == 0L) {
throw new RuntimeException("Illegal row group of 0 rows");
} else {
ColumnChunkPageReadStore columnChunkPageReadStore = new ColumnChunkPageReadStore(block.getRowCount());
List allChunks = new ArrayList();
ParquetFileReader.ConsecutiveChunkList currentChunks = null;
Iterator i$ = block.getColumns().iterator();
while(true) {
ColumnChunkMetaData mc;
ColumnDescriptor columnDescriptor;
do {
if (!i$.hasNext()) {
i$ = allChunks.iterator();
while(i$.hasNext()) {
ParquetFileReader.ConsecutiveChunkList consecutiveChunks = (ParquetFileReader.ConsecutiveChunkList)i$.next();
List chunks = consecutiveChunks.readAll(this.f);
Iterator i$ = chunks.iterator();
while(i$.hasNext()) {
ParquetFileReader.Chunk chunk = (ParquetFileReader.Chunk)i$.next();
columnChunkPageReadStore.addColumn(chunk.descriptor.col, chunk.readAllPages());
}
}
1.3 读取量结果
parquet | 数据量大小 |
---|---|
Q1 | 100G |
Q2 | 300G |
2. ORCFILE
2.1 结构图
说明:最小行级别块为:stripe
2.2 读取orcfile代码解析
1.orcReader
SearchArgument sArg = SearchArgumentFactory
.newBuilder()
.equals("x", PredicateLeaf.Type.LONG, 9996l)
.build();
String[] sCols = new String[]{"x", null };
VectorizedRowBatch batch = readSchema.createRowBatch();
Reader.Options options = reader.options().schema(readSchema);
if (sArg != null && sCols != null) {
options.searchArgument(sArg, sCols);
}
RecordReader rowIterator = reader.rows(options);
2.RecordReaderImpl.java
private void readStripe() throws IOException {
StripeInformation stripe = beginReadStripe();
planner.parseStripe(stripe, fileIncluded);
includedRowGroups = pickRowGroups();
// move forward to the first unskipped row
if (includedRowGroups != null) {
while (rowInStripe < rowCountInStripe &&
!includedRowGroups[(int) (rowInStripe / rowIndexStride)]) {
rowInStripe = Math.min(rowCountInStripe, rowInStripe + rowIndexStride);
}
}
// if we haven't skipped the whole stripe, read the data
if (rowInStripe < rowCountInStripe) {
planner.readData(indexes, includedRowGroups, false);
reader.startStripe(planner);
// if we skipped the first row group, move the pointers forward
if (rowInStripe != 0) {
seekToRowEntry(reader, (int) (rowInStripe / rowIndexStride));
}
}
}
- pickRowGroup(Pick the row groups that we need to load from the current stripe.return an array with a boolean for each row group or null if all of the row groups must be read)看如下代码,sargApp是过滤条件,后面几个表示当前stripe,还有index信息,通过这几个条件判断当前stripe是否需要读取。
protected boolean[] pickRowGroups() throws IOException {
// if we don't have a sarg or indexes, we read everything
if (sargApp == null) {
return null;
}
readRowIndex(currentStripe, fileIncluded, sargApp.sargColumns);
return sargApp.pickRowGroups(stripes.get(currentStripe),
indexes.getRowGroupIndex(),
indexes.getBloomFilterKinds(), stripeFooter.getColumnsList(),
indexes.getBloomFilterIndex(), false);
}
构造读取时分两部分,通过跟踪代码发现
会先使用index(有三种类型的index过滤) 进行stripe数据剪技,再使用schema进行列值还原。比parquet多一步使用index过滤阶段
2.3 读取量结果
假如通过index(max,min,bloomfilter等)过滤量数据量大小为xG
orcfile | 数据量大小 |
---|---|
Q1 | 100G |
Q2 | 300G-X |
2.4 hive配置语句
CREATE TABLE lxw1234_orc2 stored AS ORC
TBLPROPERTIES
('orc.compress'='SNAPPY',
'orc.create.index'='true',
"orc.bloom.filter.columns"="pcid",
'orc.bloom.filter.fpp'='0.05',
'orc.stripe.size'='10485760',
'orc.row.index.stride'='10000')
AS
SELECT CAST(siteid AS INT) AS id,
pcid
FROM lxw1234_text
DISTRIBUTE BY id sort BY id;
2.4 例子分析(写数据排序)
每个stripe由行组成,每行数据默认10000行
写数据设置
读数据设置
过滤数据结果呈现
使用coreWriter写一个文件,有100000行数据,x的范围是1-100000;然后我构造SearchArgument(x=99961),从过滤结果呈现可以看出来,过滤了前90000行数据,把当前stripe需要读取的rowInStripe设置为90000.
2.5 例子分析2(写数据列值随机)
使用coreWriter写一个文件,有100000行数据,x的范围是1-100000;但不是有序的,是随机的,从过滤结果来看,没有过滤掉数据,因此需要读取全部的stripe的数据。
3. CARBONDATA
3.1 结构图
说明:最小行级别块为:blocklet
3.2 读取carbondata代码解析
public BlockletScannedResult scanBlocklet(RawBlockletColumnChunks rawBlockletColumnChunks)
throws IOException, FilterUnsupportedException {
if (blockExecutionInfo.isDirectVectorFill()) {
return executeFilterForPages(rawBlockletColumnChunks);
} else {
return executeFilter(rawBlockletColumnChunks);
}
}
跟踪了一个select语句,会发现,如果有过滤条件,会到BlockletFilterScanner.scanBlocklet,其中isDirectVectorFill是用来控制是否把结果直接交给spark。跟executeFilter方法会发现方法上注释如下:
This method will process the data in below order
1. first apply min max on the filter tree and check whether any of the filter
is fall on the range of min max, if not then return empty result
2. If filter falls on min max range then apply filter on actual
data and get the filtered row index
3. if row index is empty then return the empty result
4. if row indexes is not empty then read only those blocks(measure or dimension)
which was present in the query but not present in the filter, as while applying filter
some of the blocks where already read and present in chunk holder so not need to
read those blocks again, this is to avoid reading of same blocks which was already read
5. Set the blocks and filter indexes to result
总结一下逻辑
- page使用最大最小值进行过滤,如果符合,进行实际数据过滤,然后拿到row index;
- 如果row index为空,返回空
- 如果不为空,拿其他列(这时如果已经读取过的block不需要再次读取)
- 返回block与index
所以carbon是通过过滤列的列值先定位到数据的row index,再通过row index去取其他列的数据。其花费的代价为过滤列的总量-过滤的数据量+取其他列时读取的数据量(这块最小单位为page)
3.3 读取量结果
跟上面一样,假如通过index(max,min,bloomfilter等)过滤量数据量大小为xG,假如需要读取的其他列的Page总数据量为yG
carbondata | 数据量大小 |
---|---|
Q1 | 100G |
Q2 | 100G-x+y |
4 结论
4.1 对比
item/存储格式 | parquet | orcfile | carbondata |
---|---|---|---|
Q1 | 100G | 100G | 100G |
Q2 | 300G | 300G -x | 100G-x+y |
索引 | 无 | column min/max,bloomfilter | cloumn min/max,bloomfilter,inverted index(占空间大) |
行列存储最小单位 | row group | stripe | blocklet |