首先分析 Hadoop 的日志格式, 日志是一行一条, 日志格式可以依次描述为:日期、时间、级别、相关类和提示信息。如下所示:
2014-01-07 00:31:25,393 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down JobTracker at hadoop1/192.168.91.101
************************************************************/
2014-01-07 00:33:42,425 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting JobTracker
STARTUP_MSG: host = hadoop1/192.168.91.101
STARTUP_MSG: args = []
STARTUP_MSG: version = 1.1.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
2014-01-07 00:33:43,305 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-01-07 00:33:43,358 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2014-01-07 00:33:43,359 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-01-07 00:33:43,359 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobTracker metrics system started
2014-01-07 00:33:43,562 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source QueueMetrics,q=default registered.
2014-01-07 00:33:44,118 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2014-01-07 00:33:44,118 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
2014-01-07 00:33:44,119 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
2014-01-07 00:33:44,120 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list
2014-01-07 00:33:44,125 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
2014-01-07 00:33:44,125 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
2014-01-07 00:33:44,126 INFO org.apache.hadoop.mapred.JobTracker: Starting jobtracker with owner as root
2014-01-07 00:33:44,187 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcDetailedActivityForPort9001 registered.
2014-01-07 00:33:44,187 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcActivityForPort9001 registered.
2014-01-07 00:33:44,188 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2014-01-07 00:33:44,490 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2014-01-07 00:33:44,805 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2014-01-07 00:33:44,825 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener
这只是部分日志。
2). 程序设计
本程序是在个人机器用 Eclipse 开发,该程序连接 Hadoop 集群,处理完的结果存储在MySQL 服务器上。
MySQL 数据库的存储信息的表“hadooplog”的 SQL 语句如下:
drop table if exists hadooplog; create table hadooplog( id int(11) not null auto_increment, rdate varchar(50) null, time varchar(50) default null, type varchar(50) default null, relateclass tinytext default null, information longtext default null, primary key (id) ) engine=innodb default charset=utf8;
3). 程序代码
package com.wzl.hive; import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; /** * 该类的主要功能是负责建立与 Hive 和 MySQL 的连接, 由于每个连接的开销比较大, 所以此类的设计采用设计模式中的单例模式。 */ class DBHelper { private static Connection connToHive = null; private static Connection connToMySQL = null; private DBHelper() { } // 获得与 Hive 连接,如果连接已经初始化,则直接返回 public static Connection getHiveConn() throws SQLException { if (connToHive == null) { try { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException err) { err.printStackTrace(); System.exit(1); } connToHive = DriverManager.getConnection("jdbc:hive://192.168.91.101:10000/default", "hive", ""); } return connToHive; } // 获得与 MySQL 连接 public static Connection getMySQLConn() throws SQLException { if (connToMySQL == null) { try { Class.forName("com.mysql.jdbc.Driver"); } catch (ClassNotFoundException err) { err.printStackTrace(); System.exit(1); } connToMySQL = DriverManager.getConnection("jdbc:mysql://192.168.91.101:3306/hive?useUnicode=true&characterEncoding=UTF8", "root", "root"); //编码不要写成UTF-8 } return connToMySQL; } public static void closeHiveConn() throws SQLException { if (connToHive != null) { connToHive.close(); } } public static void closeMySQLConn() throws SQLException { if (connToMySQL != null) { connToMySQL.close(); } } public static void main(String[] args) throws SQLException { System.out.println(getMySQLConn()); closeMySQLConn(); } }
package com.wzl.hive; import java.sql.Connection; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; /** * * 针对 Hive 的工具类 */ class HiveUtil { // 创建表 public static void createTable(String sql) throws SQLException { Connection conn = DBHelper.getHiveConn(); Statement stmt = conn.createStatement(); ResultSet res = stmt.executeQuery(sql); } // 依据条件查询数据 public static ResultSet queryData(String sql) throws SQLException { Connection conn = DBHelper.getHiveConn(); Statement stmt = conn.createStatement(); ResultSet res = stmt.executeQuery(sql); return res; } // 加载数据 public static void loadData(String sql) throws SQLException { Connection conn = DBHelper.getHiveConn(); Statement stmt = conn.createStatement(); ResultSet res = stmt.executeQuery(sql); } // 把数据存储到 MySQL 中 public static void hiveToMySQL(ResultSet res) throws SQLException { Connection conn = DBHelper.getMySQLConn(); Statement stmt = conn.createStatement(); while (res.next()) { String rdate = res.getString(1); String time = res.getString(2); String type = res.getString(3); String relateclass = res.getString(4); String information = res.getString(5) + res.getString(6) + res.getString(7); StringBuffer sql = new StringBuffer(); sql.append("insert into hadooplog values(0,'"); sql.append(rdate + "','"); sql.append(time + "','"); sql.append(type + "','"); sql.append(relateclass + "','"); sql.append(information + "')"); System.out.println(sql.toString()); int i = stmt.executeUpdate(sql.toString()); } } }
package com.wzl.hive; import java.sql.ResultSet; import java.sql.SQLException; public class AnalyszeHadoopLog { public static void main(String[] args) throws SQLException { StringBuffer sql = new StringBuffer(); // 第一步:在 Hive 中创建表 sql.append("create table if not exists loginfo( "); sql.append("rdate string, "); sql.append("time array<string>, "); sql.append("type string, "); sql.append("relateclass string, "); sql.append("information1 string, "); sql.append("information2 string, "); sql.append("information3 string) "); sql.append("row format delimited fields terminated by ' ' "); sql.append("collection items terminated by ',' "); sql.append("map keys terminated by ':'"); System.out.println(sql); HiveUtil.createTable(sql.toString()); // 第二步:加载 Hadoop 日志文件 sql.delete(0, sql.length()); sql.append("load data local inpath "); sql.append("'/usr/local/hadoop/logs/hadoop-root-jobtracker-hadoop1.log'"); sql.append(" overwrite into table loginfo"); System.out.println(sql); HiveUtil.loadData(sql.toString()); // 第三步:查询有用信息 sql.delete(0, sql.length()); sql.append("select rdate,time[0],type,relateclass,"); sql.append("information1,information2,information3 "); sql.append("from loginfo where type='INFO'"); System.out.println(sql); ResultSet res = HiveUtil.queryData(sql.toString()); // 第四步:查出的信息经过变换后保存到 MySQL 中 HiveUtil.hiveToMySQL(res); // 第五步:关闭 Hive 连接 DBHelper.closeHiveConn(); // 第六步:关闭 MySQL 连接 DBHelper.closeMySQLConn(); } }
- 在运行前必须保证hive远端服务端口是开的 执行命令:nohup hive --service hiveserver & 如果没有执行这句命令常出现这个错误:Could not establish connection to 192.168.91.101:10000/default: java.net.ConnectException: Connection refused: connect
- mysql已经建立了hadooplog表
- mysql数据库允许本机连接数据库执行命令:grant all privileges on *.* to root@'%' identified by 'root'; 这句意思是允许任何的ip都能访问mysql数据库。如果如果没有执行这句命令常出现这个错误:java连接linux中mysql出现:Access denied for user 'root'@'192.168.91.1' (using password: YES)
mysql> use hive; mysql> show tables; mysql> select * from hadooplog;
5). 经验总结
在示例中同时对 Hive 的数据仓库库和 MySQL 数据库进行操作,虽然都是使用了 JDBC接口,但是一些地方还是有差异的,这个实战示例能比较好地体现 Hive 与关系型数据库的异同。
如果我们直接采用 MapReduce 来做,效率会比使用 Hive 高,因为 Hive 的底层就是调用了 MapReduce,但是程序的复杂度和编码量都会大大增加,特别是对于不熟悉 MapReduce编程的开发人员,这是一个棘手问题。Hive 在这两种方案中找到了平衡,不仅处理效率较高,而且实现起来也相对简单,给传统关系型数据库编码人员带来了便利,这就是目前 Hive被许多商业组织所采用的原因。