学习Parquet文件格式

文章目录

  • 学习目的
  • Parquet文件存储结构

学习目的

  • parquet文件作为列存的存储结构
  • parquet文件的读写主要流程和调用接口
  • spark对parquet文件读写的优化
  • spark是如何实现向量化数据读取的

Parquet文件存储结构

学习Parquet文件格式_第1张图片

例如一个实际的parquet文件meta信息

parquet-tools meta --debug part-00000-95a6898f-c2aa-4e89-86a6-4f17a2a8fe26.c000.snappy.parquet

creator:               parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
extra:                 org.apache.spark.version = 3.0.0
extra:                 org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"biz_id","type":"string","nullable":true,"metadata":{"comment":"营销/问答业务统一 trace id"}},{"name":"scene_id","type":"integer","nullable":true,"metadata":{"comment":"场景"}},{"name":"store_id","type":"string","n [more]...

file schema:           spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
biz_id:                OPTIONAL BINARY O:UTF8 R:0 D:1
scene_id:              OPTIONAL INT32 R:0 D:1
store_id:              OPTIONAL BINARY O:UTF8 R:0 D:1
store_name:            OPTIONAL BINARY O:UTF8 R:0 D:1
buyer_nick:            OPTIONAL BINARY O:UTF8 R:0 D:1
trigger_time_in_ms:    OPTIONAL INT64 R:0 D:1
dispatch_time_in_ms:   OPTIONAL INT64 R:0 D:1
arrived_time_in_ms:    OPTIONAL INT64 R:0 D:1
assistant_nick:        OPTIONAL BINARY O:UTF8 R:0 D:1
trade_id:              OPTIONAL BINARY O:UTF8 R:0 D:1
paid_time_in_ms:       OPTIONAL INT64 R:0 D:1
order_fee:             OPTIONAL INT64 O:DECIMAL R:0 D:1
order_number:          OPTIONAL INT32 R:0 D:1
indirect_order_fee:    OPTIONAL INT64 O:DECIMAL R:0 D:1
indirect_order_number: OPTIONAL INT32 R:0 D:1
is_arrived:            REQUIRED BOOLEAN R:0 D:0
result:                REQUIRED BINARY O:UTF8 R:0 D:0
sentence:              OPTIONAL F:1
.list:                 REPEATED F:1
..element:             REQUIRED BINARY O:UTF8 R:1 D:2
task:                  REQUIRED F:5
.task_id:              OPTIONAL INT64 R:0 D:1
.round_id:             OPTIONAL INT64 R:0 D:1
.round:                OPTIONAL INT32 R:0 D:1
.entry_name:           OPTIONAL BINARY O:UTF8 R:0 D:1
.strategy:             OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1:           RC:350242 TS:114447905 // 测试文件只有一个RowGroup,多个的话,会循环显示
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
biz_id:                 BINARY SNAPPY DO:0 FPO:4 SZ:12781311/14711901/1.15 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
scene_id:               INT32 SNAPPY DO:0 FPO:12781315 SZ:141/135/0.96 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
store_id:               BINARY SNAPPY DO:0 FPO:12781456 SZ:625362/651136/1.04 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
store_name:             BINARY SNAPPY DO:0 FPO:13406818 SZ:661669/720686/1.09 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
buyer_nick:             BINARY SNAPPY DO:0 FPO:14068487 SZ:4926923/6019612/1.22 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
trigger_time_in_ms:     INT64 SNAPPY DO:0 FPO:18995410 SZ:1141626/1301269/1.14 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
dispatch_time_in_ms:    INT64 SNAPPY DO:0 FPO:20137036 SZ:2021695/2802163/1.39 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
arrived_time_in_ms:     INT64 SNAPPY DO:0 FPO:22158731 SZ:1866923/2578762/1.38 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
assistant_nick:         BINARY SNAPPY DO:0 FPO:24025654 SZ:944211/1151429/1.22 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
trade_id:               BINARY SNAPPY DO:0 FPO:24969865 SZ:5252882/8012876/1.53 VC:350242 ENC:BIT_PACKED,RLE,PLAIN
paid_time_in_ms:        INT64 SNAPPY DO:0 FPO:30222747 SZ:469431/566161/1.21 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
order_fee:              INT64 SNAPPY DO:0 FPO:30692178 SZ:194907/233987/1.20 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
order_number:           INT32 SNAPPY DO:0 FPO:30887085 SZ:70615/87743/1.24 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
indirect_order_fee:     INT64 SNAPPY DO:0 FPO:30957700 SZ:241548/282655/1.17 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
indirect_order_number:  INT32 SNAPPY DO:0 FPO:31199248 SZ:79962/98381/1.23 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
is_arrived:             BOOLEAN SNAPPY DO:0 FPO:31279210 SZ:5913/43819/7.41 VC:350242 ENC:BIT_PACKED,PLAIN
result:                 BINARY SNAPPY DO:0 FPO:31285123 SZ:86680/88435/1.02 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY
sentence:
.list:
..element:              BINARY SNAPPY DO:0 FPO:31371803 SZ:32418858/73409092/2.26 VC:377833 ENC:RLE,PLAIN
task:
.task_id:               INT64 SNAPPY DO:0 FPO:63790661 SZ:666110/714412/1.07 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.round_id:              INT64 SNAPPY DO:0 FPO:64456771 SZ:664298/711780/1.07 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.round:                 INT32 SNAPPY DO:0 FPO:65121069 SZ:129151/131726/1.02 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.entry_name:            BINARY SNAPPY DO:0 FPO:65250220 SZ:43694/64940/1.49 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
.strategy:              BINARY SNAPPY DO:0 FPO:65293914 SZ:43637/64805/1.49 VC:350242 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE

被缩减的字段说明:

  • RC = Record Count, TS = Total Byte Size
  • DO = DictionaryPageOffset
  • FPO = FirstDataPageOffset
  • SZ : 字段第一个值 = TotalSize, 第二个值 = TotalUncompressedSize, 第三个值 ratio = TotalUncompressedSize / TotalSize
  • VC = ValueCount
  • ENC = Encodings

你可能感兴趣的:(spark)