Hive 资料整理系列 五 Hive-0.5中SerDe概述

一、背景

1、当进程在进行远程通信时,彼此可以发送各种类型的数据,无论是什么类型的数据都会以二进制序列的形式在网络上传送。发送方需要把对象转化为字节序列才可在网络上传输,称为对象序列化;接收方则需要把字节序列恢复为对象,称为对象的反序列化。

2、Hive的反序列化是对key/value反序列化成hive table的每个列的值。

3、Hive可以方便的将数据加载到表中而不需要对数据进行转换,这样在处理海量数据时可以节省大量的时间。

二、技术细节

1、SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。

2、用户在建表时可以用自定义的SerDe或使用Hive自带的SerDe,SerDe能为表指定列,且对列指定相应的数据。

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]

创建指定SerDe表时,使用row format row_format参数,例如:

a、添加jar包。在hive客户端输入:hive>add jar /run/serde_test.jar; 或者在linux shell端执行命令:${HIVE_HOME}/bin/hive -auxpath /run/serde_test.jar b、建表:create table serde_table row format serde 'hive.connect.TestDeserializer';

3、编写序列化类TestDeserializer。实现Deserializer接口的三个函数:

a)初始化:initialize(Configuration conf, Properties tb1)。

b)反序列化Writable类型返回Object:deserialize(Writable blob)。

c)获取deserialize(Writable blob)返回值Object的inspector:getObjectInspector()。


package com.sina.hive.test;


import java.util.Properties;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.io.Writable;

public interface Deserializer {
	/**
	 * * Initialize the HiveDeserializer. * @param conf System properties * @param
	 * tbl table properties * @throws SerDeException
	 */
	public void initialize(Configuration conf, Properties tbl)
			throws SerDeException;

	/**
	 * * Deserialize an object out of a Writable blob. * In most cases, the
	 * return value of this function will be constant since the function * will
	 * reuse the returned object. * If the client wants to keep a copy of the
	 * object, the client needs to clone the * returned value by calling
	 * ObjectInspectorUtils.getStandardObject(). * @param blob The Writable
	 * object containing a serialized object * @return A Java object representing
	 * the contents in the blob.
	 */
	public Object deserialize(Writable blob) throws SerDeException;

	/**
	 * * Get the object inspector that can be used to navigate through the
	 * internal * structure of the Object returned from deserialize(...).
	 */
	public ObjectInspector getObjectInspector() throws SerDeException;
}

实现一行数据划分成hive表的time,userid,host,path四个字段的反序列化类。例如:

package com.sina.hive.test;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde2.Deserializer;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.ObjectInspectorOptions;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

public class TestDeserializer implements Deserializer {
	private static List<String> FieldNames = new ArrayList<String>();
	private static List<ObjectInspector> FieldNamesObjectInspectors = new ArrayList<ObjectInspector>();
	static {
		FieldNames.add("time");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(Long.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("userid");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(Integer.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("host");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(String.class,
						ObjectInspectorOptions.JAVA));
		FieldNames.add("path");
		FieldNamesObjectInspectors.add(ObjectInspectorFactory
				.getReflectionObjectInspector(String.class,
						ObjectInspectorOptions.JAVA));
	}

	@Override
	public Object deserialize(Writable blob) {
		try {
			if (blob instanceof Text) {
				String line = ((Text) blob).toString();
				if (line == null)
					return null;
				String[] field = line.split("/t");
				if (field.length != 3) {
					return null;
				}
				List<Object> result = new ArrayList<Object>();
				URL url = new URL(field[2]);
				Long time = Long.valueOf(field[0]);
				Integer userid = Integer.valueOf(field[1]);
				result.add(time);
				result.add(userid);
				result.add(url.getHost());
				result.add(url.getPath());
				return result;
			}
		} catch (MalformedURLException e) {
			e.printStackTrace();
		}
		return null;
	}

	@Override
	public ObjectInspector getObjectInspector() throws SerDeException {
		return ObjectInspectorFactory.getStandardStructObjectInspector(
				FieldNames, FieldNamesObjectInspectors);
	}

	@Override
	public void initialize(Configuration arg0, Properties arg1)
			throws SerDeException {
	}
}


测试HDFS上hive表数据,如下为一条测试数据:

1234567891012 123456 http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

hive> add jar /run/jar/merg_hua.jar; Added /run/jar/merg_hua.jar to class path hive> create table serde_table row format serde 'hive.connect.TestDeserializer'; Found classfor hive.connect.TestDeserializer OK Time taken: 0.028 seconds hive> describe serde_table; OK time bigint from deserializer userid int from deserializer host string from deserializer path string from deserializer Time taken: 0.042 seconds hive> select * from serde_table; OK 1234567891012 123456 wiki.apache.org /hadoop/Hive/LanguageManual/UDF Time taken: 0.039 seconds
三、总结

1、创建Hive表使用序列化时,需要自写一个实现Deserializer的类,并且选用create命令的row format参数。

2、在处理海量数据的时候,如果数据的格式与表结构吻合,可以用到Hive的反序列化而不需要对数据进行转换,可以节省大量的时间。



你可能感兴趣的:(Hive 资料整理系列 五 Hive-0.5中SerDe概述)