RegexSerDe

官方示例在:

https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default.
For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662

add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

 

官方issues是 https://issues.apache.org/jira/browse/HIVE-167

官方UT在contrib/src/test/queries/clientnegative/serde_regex.q文件中。

RegexSerDe基于正则解析一条记录(row),使用java的Pattern。input.regex是Pattern解析的规则。output.format.string描述如何序列化一条记录,使用java的String,String.format(outputFormatString, outputFields);

outputFormatString = tbl.getProperty("output.format.string");

 

 

 

你可能感兴趣的:(regex)