最近在测试hive导入solr,github上有个相关的代码
https://github.com/chimpler/hive-solr
其原理就是实现inputformat和outputformat,通过mapred来做数据的读写操作。
测试的表结构:
show create table table_in_solr1; CREATE EXTERNAL TABLE table_in_solr1( id string COMMENT 'from deserializer', cookie_id_s string COMMENT 'from deserializer', first_url_s string COMMENT 'from deserializer', warehouse_s string COMMENT 'from deserializer') ROW FORMAT SERDE 'com.chimpler.hive.solr.SolrSerDe' STORED BY 'com.chimpler.hive.solr.SolrStorageHandler' WITH SERDEPROPERTIES ( 'serialization.format'='1', 'solr.column.mapping'='id,cookie_id_s,first_url_s,warehouse_s') LOCATION 'hdfs://xxxxxx:9000/bip/hive_warehouse/table_in_solr1' TBLPROPERTIES ( 'solr.url'='http://xxxxxxx:8888/solr/userinfo', 'transient_lastDdlTime'='1401357066', 'solr.buffer.input.rows'='10000', 'solr.buffer.output.rows'='10000')
导入数据之后,进行count测试:
select count(1) from table_in_solr1; Ended Job = job_1401419652664_0010 with errors Error during job, obtaining debugging information... Examining task ID: task_1401419652664_0010_m_000000 (and more) from job job_1401419652664_0010 Task with the most failures(4): ----- Task ID: task_1401419652664_0010_m_000000 URL: http://nfzm:8088/taskdetails.jsp?jobid=job_1401419652664_0010&tipid=task_1401419652664_0010_m_000000 ----- Diagnostic Messages for this Task: Error: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:243) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:526) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:166) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:407) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155) Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:470) at java.lang.Integer.parseInt(Integer.java:499) at com.chimpler.hive.solr.SolrInputFormat.getReadColumnIDs(SolrInputFormat.java:38) at com.chimpler.hive.solr.SolrInputFormat.getRecordReader(SolrInputFormat.java:50) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:240) ... 9 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec
而count(一个字段)是ok的。
从explain的结果来看。
count(字段)对比count(1)增加了Select Operator:
Select Operator expressions: expr: id type: string outputColumnNames: id Group By Operator aggregations: expr: count(id)
从job的出错信息来看,是由于类型转换出错导致,根据堆栈的信息,可以定位到出错的代码片段:
at com.chimpler.hive.solr.SolrInputFormat.getReadColumnIDs(SolrInputFormat.java:38) 即: public static final String READ_COLUMN_IDS_CONF_STR = "hive.io.file.readcolumn.ids"; public static final String READ_COLUMN_NAMES_CONF_STR = "hive.io.file.readcolumn.names"; private static final String READ_COLUMN_IDS_CONF_STR_DEFAULT = ""; private static final String READ_ALL_COLUMNS = "hive.io.file.read.all.columns"; private static final boolean READ_ALL_COLUMNS_DEFAULT = true; /** * Returns an array of column ids(start from zero) which is set in the given * parameter <tt>conf</tt>. */ public static List<Integer> getReadColumnIDs(JobConf conf) { String skips = conf.get(READ_COLUMN_IDS_CONF_STR, READ_COLUMN_IDS_CONF_STR_DEFAULT); String[] list = skips.split(","); List<Integer> result = new ArrayList<Integer>(list.length); for (String element : list) { // it may contain duplicates, remove duplicates Integer toAdd = Integer.parseInt(element); if (!result.contains(toAdd)) { result.add(toAdd); } } return result; }
即在hive的map阶段,通过SolrInputFormat做数据读取时,hive.io.file.readcolumn.ids的默认值是空,如果没有字段名 ,就会产生空值,在Integer.parseInt(element)就会报错。
这点可以通过下面的代码进行验证:
import java.io.IOException; import java.util.ArrayList; import java.util.List; public class ttt { public static void main(String[] args) { String skips = ""; String[] list = skips.split(","); System.out.println(list); List<Integer> result = new ArrayList<Integer>(list.length); for (String element : list) { try{ Integer toAdd = Integer.parseInt(element); if (!result.contains(toAdd)) { result.add(toAdd); } } catch (Exception e) { System.out.println(e); } } } }