半结构化数据,是介于结构化和非结构化之间的数据。和普通纯文本相比,半结构化数据具有一定的结构性。和结构化数据相比,其结构变化复杂,我们又不能方便的使用结构化的方式去描述它。
半结构的数据中通常即包括数据本身,也包括数据结构的描述。比如场景的有JSON、XML,他们即包含数据,也包含数据的描述(元数据信息),具体半结构化特征如下:
CREATE TABLE `httplogs` (
`@timestamp` int(11) NULL COMMENT "",
`clientip` varchar(20) NULL COMMENT "",
`request` text NULL COMMENT "",
...
) ENGINE=OLAP
DUPLICATE KEY(`@timestamp`,`clientip`)
PARTITION BY RANGE(`@timestamp`)()
DISTRIBUTED BY HASH(`clientip`) BUCKETS 12
{"@timestamp":1676012713,"clientip":"192.168.1.1","request":"test"}
{
"@timestamp":1676012713,
"clientip":"192.168.1.1",
"request":"test",
"uuid":1
}
将上面数据写入Doris,Doris发现JSON数据比之前多了一列uuid,该列的数据类型为为int。此时Doris会自动将一个字段名称为uuid,字段类型为int的列维护到Doris表结构中。因此表结构变为:
CREATE TABLE `httplogs` (
`@timestamp` int(11) NULL COMMENT "",
`clientip` varchar(20) NULL COMMENT "",
`request` text NULL COMMENT "",
`uuid` int COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`@timestamp`,`clientip`)
PARTITION BY RANGE(`@timestamp`)()
DISTRIBUTED BY HASH(`clientip`) BUCKETS 12
{
"@timestamp":1676012713,
"clientip":"192.168.1.1",
"request":"test",
"uuid":"2",
"response":{
"status":0,
"msg":"",
"data":{
"apraise":"0",
"favorite":"0",
"comments":"2",
"pv":202
}
}
}
写入上述数据,SelectDB会自动将response列以及其中的JSON数据进一步展开并进行元数据的维护。修改后的表结构为:
CREATE TABLE `httplogs` (
`@timestamp` int(11) NULL,
`clientip` varchar(20) NULL,
`request` text NULL,
`uuid` int(11) NULL COMMENT 'auto change 2023-02-14T15:53:10+08:00[Asia/Shanghai]',
`response.status` int(11) NULL COMMENT 'auto change 2023-02-14T15:56:41+08:00[Asia/Shanghai]',
`response.msg` text NULL COMMENT 'auto change 2023-02-14T15:56:41+08:00[Asia/Shanghai]',
`response.data.apraise` text NULL COMMENT 'auto change 2023-02-14T15:56:41+08:00[Asia/Shanghai]',
`response.data.favorite` text NULL COMMENT 'auto change 2023-02-14T15:56:41+08:00[Asia/Shanghai]',
`response.data.comments` text NULL COMMENT 'auto change 2023-02-14T15:56:41+08:00[Asia/Shanghai]',
`response.data.pv` int(11) NULL COMMENT 'auto change 2023-02-14T15:56:41+08:00[Asia/Shanghai]'
) ENGINE=OLAP
DUPLICATE KEY(`@timestamp`, `clientip`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`clientip`) BUCKETS 12
PROPERTIES (
"persistent" = "false"
);
select * from httplogs where `response.data.pv` ="202";