Iceberg是数据湖热门组件之一,本系列文章将深入探究一二。
首先将研究iceberg底层存储。
1、启动本地的Spark
./bin/spark-sql \
--packages org.apache.iceberg:iceberg-spark3-runtime:0.12.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
分别使用v1 v2两种格式创建表
使用format-version 1创建表table
CREATE TABLE local.db.table (id bigint, data string) USING iceberg;
打开目录,其结构如下:
(base) ➜ table ll -R
total 0
drwxr-xr-x 6 liliwei staff 192B Jan 2 21:22 metadata
./metadata:
total 16
-rw-r--r--@ 1 liliwei staff 1.2K Jan 2 21:22 v1.metadata.json
-rw-r--r--@ 1 liliwei staff 1B Jan 2 21:22 version-hint.text
(base) ➜ table
查看v1.metadata.json,内容如下:
{
"format-version" : 1,
"table-uuid" : "0dc08d49-ed4d-49bb-8ddf-006e37c65372",
"location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table",
"last-updated-ms" : 1641129739691,
"last-column-id" : 2,
"schema" : {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "id",
"required" : false,
"type" : "long"
}, {
"id" : 2,
"name" : "data",
"required" : false,
"type" : "string"
} ]
},
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "id",
"required" : false,
"type" : "long"
}, {
"id" : 2,
"name" : "data",
"required" : false,
"type" : "string"
} ]
} ],
"partition-spec" : [ ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"owner" : "liliwei"
},
"current-snapshot-id" : -1,
"snapshots" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ ]
}
查看version-hint.text,内容如下:
1
使用format-version 2创建表tableV2
CREATE TABLE local.db.tableV2 (id bigint, data string)
USING iceberg
TBLPROPERTIES ('format-version'='2');
tavleV2的目录结构如下:
(base) ➜ tableV2 cd metadata
(base) ➜ metadata ll
total 16
-rw-r--r-- 1 liliwei staff 936B Jan 2 21:38 v1.metadata.json
-rw-r--r-- 1 liliwei staff 1B Jan 2 21:38 version-hint.text
(base) ➜ metadata
v1.metadata.json的内容如下:
{
"format-version" : 2,
"table-uuid" : "67b54789-070c-4600-b2ff-3b9a0a774e4a",
"location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/tableV2",
"last-sequence-number" : 0,
"last-updated-ms" : 1641130714999,
"last-column-id" : 2,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "id",
"required" : false,
"type" : "long"
}, {
"id" : 2,
"name" : "data",
"required" : false,
"type" : "string"
} ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"owner" : "liliwei"
},
"current-snapshot-id" : -1,
"snapshots" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ ]
}
version-hint.text的内容如下:
1
下面,我们将插入数据到表中,查看其变化:
移步:iceberg系列(1):存储详解-初探2