2019独角兽企业重金招聘Python工程师标准>>>

1、hive读取文件机制

1、使用inputformat对象来读取文件，默认是。返回一行行的数据。
2、使用SerDe类默认是来对每一行数据进行字段切割，对应表中的字段。

2、问题：SerDe默认情况下只支持“单字符”切割，如果分隔符为多字符的，那么可以进行一下处理。

1、使用RegexSerDe通过正则表达式来抽取字段

  create table t_bi_reg(id string,name string)

  row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'

  with serdeproperties(

  	'input.regex'='(.*)\\|\\|(.*)',

  	'output.format.string'='%1$s %2$s'

  )

  stored as textfile;

  hive>load data local inpath '/root/hivedata/bi.dat' into table t_bi_reg;

  hive>select * from t_bi_reg;

2、自定义inputFormat类来处理。

原理：其实就是在inputformat读取数据的时候，将读出来的信息进行多字符转化为单字符，这样就可以用单字符进行切割了。

自定义类：

  public class BiDelimiterInputFormat extends TextInputFormat {
  	[@Override](https://my.oschina.net/u/1162528)
  	public RecordReader getRecordReader(
  	InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
  		reporter.setStatus(genericSplit.toString());
  		MyDemoRecordReader reader = new MyDemoRecordReader(
  		new LineRecordReader(job, (FileSplit) genericSplit));
  		return reader;
  	}

  	public static class MyDemoRecordReader implements RecordReader {
  		LineRecordReader reader;
  		Text text;
  		public MyDemoRecordReader(LineRecordReader reader) {
  			this.reader = reader;
  			text = reader.createValue();
  		}

  		[@Override](https://my.oschina.net/u/1162528)
  		public void close() throws IOException {
  			reader.close();
  		}

  		[@Override](https://my.oschina.net/u/1162528)
  		public LongWritable createKey() {
  			return reader.createKey();
  		}

  		[@Override](https://my.oschina.net/u/1162528)
  		public Text createValue() {
  			return new Text();
  		}

  		[@Override](https://my.oschina.net/u/1162528)
  		public long getPos() throws IOException {
  			return reader.getPos();
  		}

  		@Override
  		public float getProgress() throws IOException {
  			return reader.getProgress();
  		}

  		@Override
  		public boolean next(LongWritable key, Text value) throws IOException {
  			while (reader.next(key, text)) {
  				//其实就是在TextInputFormat 的源码中加上一行替换的操作。
  				String strReplace = text.toString().toLowerCase().replaceAll("\\|\\|", "|");
  				Text txtReplace = new Text();
  				txtReplace.set(strReplace);
  				value.set(txtReplace.getBytes(), 0, txtReplace.getLength());
  				return true;
  			}
  			return false;
  		}
  	}
  }

3、将这个类打包成jar，放入hive安装目录下的lib文件夹中。
```
  hive>add jar /root/apps/hive/lib/myinput.jar
```

4、使用：

使用以下语句建表即可：

  hive> create table t_bi(id string,name string)
  	   > row format delimited
  	   > fields terminated by '|'
  	   > stored as inputformat 'cn.itcast.bigdata.hive.inputformat.BiDelimiterInputFormat' outputformat            'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

  hive> load data local inpath '/root/hivedata/bi.dat' into table t_bi;

  hive> select * from t_bi;
  OK
  01 zhangsan
  02 lisi

6、Hive的特殊分隔符处理

1、hive读取文件机制

2、问题：SerDe默认情况下只支持“单字符”切割，如果分隔符为多字符的，那么可以进行一下处理。

你可能感兴趣的:(6、Hive的特殊分隔符处理)