基于 CentOS 7.3.x + hadoop v2.9.0 集群的 Hive 2.3.2 的安装与使用

前言

安装Apache Hive前提是要先安装hadoop集群,并且hive只需要在hadoop的namenode节点集群里安装即可:需要在namenode上安装,可以不在datanode节点的机器上安装。

还需要说明的是,虽然修改配置文件并不需要把hadoop运行起来,但是本文中用到了hadoop的hdfs命令,在执行这些命令时你必须确保hadoop是正在运行着的,而且启动hive的前提也需要hadoop在正常运行着,所以建议先把hadoop集群启动起来。

本次安装的软件版本罗列如下:

  1. CentOS v7.3.x ;  
  2. Hadoop v 2.9.0 集群 ; 
  3. JDK8 ; 
  4. Hive 2.3.2

  有关如何在CentOS7.3.x 上安装hadoop集群请参考我的博客: CentOS7.3.x + Hadoop 2.9.0 集群搭建实战

1.下载Apache Hadoop

下载地址:http://hive.apache.org/downloads.html

点击下图中的某个下载地址,优先选择国内源, 本次安装下载的上2.3.2版本,下载地址如下:

http://ftp.cuhk.edu.hk/pub/packages/apache.org/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz

2.安装Apache Hive

2.1.上载和解压缩

 把 apache-hive-2.3.2-bin.tar.gz 下载到Hadoop NameNode主机上,并解压到 /opt目录下。

# cp apache-hive-2.3.2-bin.tar.gz /opt
# cd /opt ; tar zxvf apache-hive-2.3.2-bin.tar.gz 
  • 2.2.配置环境变量
  • # vim /etc/profile
    #在文件结尾添加内容如下:
    export HIVE_HOME=/opt/apache-hive-2.3.2-bin/
    export PATH=$PATH:$HIVE_HOME/bin
    # . /etc/profile
  • 2.3.Hive配置Hadoop HDFS

2.3.1 hive-site.xml配置

进入目录$HIVE_HOME/conf,将hive-default.xml.template文件复制一份并改名为hive-site.xml

 cd $HIVE_HOME/conf ; cp hive-default.xml.template hive-site.xml

  在hive-site.xml中设置有如下配置,你自己在你的环境里修改为别的目录也可以。


    hive.metastore.warehouse.dir
    /data/hive/warehouse
    location of default database for the warehouse
  

    hive.exec.scratchdir
    /data/hive/tmp
    HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/ is created, with ${hive.scratch.dir.permission}.
  

  • 
        hive.druid.broker.address.default
        10.70.27.12:8082
        
          Address of the Druid broker. If we are querying Druid from Hive, this address needs to be
          declared
        
      
      
        hive.druid.coordinator.address.default
        10.70.27.8:8081
        Address of the Druid coordinator. It is used to check the load status of newly created segments
      


  • 执行hadoop命令新建/data/hive/warehouse (上面配置文件中指定的)目录:
#新建目录/data/hive/warehouse
# $HADOOP_HOME/bin/hdfs dfs -mkdir -p /data/hive/warehouse
#给新建的目录赋予读写权限
# $HADOOP_HOME/bin/hdfs dfs -chmod 777 /data/hive/warehouse
#查看修改后的权限
# $HADOOP_HOME/bin/hdfs dfs -ls /data/hive
Found 1 items
drwxrwxrwx   - root supergroup          0 2018-03-19 20:25 /data/hive/warehouse

#运用hadoop命令新建/data/hive/tmp目录
# $HADOOP_HOME/bin/hdfs dfs -mkdir -p /data/hive/tmp
#给目录/tmp/hive赋予读写权限
# $HADOOP_HOME/bin/hdfs dfs -chmod 777 /data/hive/tmp
#检查创建好的目录
# $HADOOP_HOME/bin/hdfs dfs -ls /data/hive/
Found 2 items
drwxrwxrwx   - root supergroup          0 2018-03-19 20:32 /data/hive/tmp
drwxrwxrwx   - root supergroup          0 2018-03-19 20:25 /data/hive/warehouse
  • 2.3.2修改$HIVE_HOME/conf/hive-site.xml中的临时目录

    •  按下面的步骤修改文件 $HIVE_HOME/conf/hive-site.xml.
    1. 将文件中的所有 ${system:java.io.tmpdir}替换成 /opt/apache-hive-2.3.2-bin/tmp
    •  2. 将文件中所有的${system:user.name}替换为 root
[root@apollo conf]# cd $HIVE_HOME
[root@apollo hive]# mkdir tmp

2.4安装配置mysql

2.4.1.安装 mysql

    CentOS7.0安装mysql请参考:CentOS7 rpm包安装mysql5.7,本文不再累述。

2.4.2. 把mysql的驱动包上传到Hive的lib目录下:

到下面的官方网站上去下载mysql connector:

https://dev.mysql.com/downloads/connector/j/

本文选择的是mysql-connector-java-5.1.46.tar.gz,然后按如下步骤把它copy到hive系统中

# tar zxvf  mysql-connector-java-5.1.46.tar.gz
# cd mysql-connector-java-5.1.46; cp mysql-connector-java-5.1.46.jar $HIVE_HOME/lib
  • 2.4.3修改hive-site.xml数据库相关配置
  •   按以下步骤修改 $HIVE_HOME/conf/hive-site.xml  文件。
  • 搜索 javax.jdo.option.ConnectionURL, 将该name对应的value修改为MySQL的地址:

    <property>
      <name>javax.jdo.option.ConnectionURLname>
      <value>jdbc:mysql://10.70.27.12:3306/hive?createDatabaseIfNotExist=truevalue>
      <description>
        JDBC connect string for a JDBC metastore.
        To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
        For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
      description>
    property>
  • 搜索javax.jdo.option.ConnectionDriverName,将该name对应的value修改为MySQL驱动类路径:
  • <property>
      <name>javax.jdo.option.ConnectionDriverNamename>
      <value>com.mysql.jdbc.Drivervalue>
      <description>Driver class name for a JDBC metastoredescription>
    property>
    <property>
  • 搜索javax.jdo.option.ConnectionUserName,将对应的value修改为MySQL数据库登录名:
  • <property>
      <name>javax.jdo.option.ConnectionUserNamename>
      <value>hivevalue>
      <description>Username to use against metastore databasedescription>
    property>
  • 搜索javax.jdo.option.ConnectionPassword,将对应的value修改为MySQL数据库的登录密码:
  • <property>
      <name>javax.jdo.option.ConnectionPasswordname>
      <value>hive888value>
      <description>password to use against metastore databasedescription>
    property>
  • 搜索hive.metastore.schema.verification,将对应的value修改为false:
  • <property>
      <name>hive.metastore.schema.verificationname>
      <value>falsevalue>
      <description>
        Enforce metastore schema version consistency.
        True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
              schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
              proper metastore schema migration. (Default)
        False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
      description>
    property>
    • 2.4.4 在$HIVE_HOME/conf目录下新建hive-env.sh
    • # cd $HIVE_HOME/conf
      #将hive-env.sh.template 复制一份并重命名为hive-env.sh
      # cp hive-env.sh.template hive-env.sh
      #打开hive-env.sh并添加如下内容
      # vim hive-env.sh
      export HADOOP_HOME=/opt/hadoop-2.9.0
      export HIVE_CONF_DIR=/opt/apache-hive-2.3.2-bin/conf
      export HIVE_AUX_JARS_PATH=/opt/apache-hive-2.3.2-bin/lib
  • 3.启动和测试

3.1.MySQL数据库进行初始化

首先用root登陆mysql去授权和建库。登陆后执行下面的命令。create user 'hive'@'%' identified by 'hive888';  create database hive DEFAULT CHARSET utf8 COLLATE utf8_general_ci; 
GRANT ALL ON hive.* TO 'hive'@'%';
flush privileges;
quit;

然后进入$HIVE/bin
# cd $HIVE_HOME/bin
#对数据库进行初始化:
# schematool -initSchema -dbType mysql
  • 输出如下:
  • SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.2-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Metastore connection URL:        jdbc:mysql://10.70.27.12:3306/hive?createDatabaseIfNotExist=true
    Metastore Connection Driver :    com.mysql.jdbc.Driver
    Metastore connection User:       root
    Starting metastore schema initialization to 2.3.0
    Initialization script hive-schema-2.3.0.mysql.sql
    Initialization script completed
    schemaTool completed 

  • 执行成功后,查看mysql数据库
  • # mysql -uroot -p
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| BASEDB             |
| PMDB               |
| REGISTRY_DB        |
| REGISTRY_LOCAL1    |
| REGISTRY_LOCAL2    |
| WSO2_USER_DB       |
| baseservice        |
| devicecmd          |
| druid              |
| hive               |
| mysql              |
| performance_schema |
| report             |
| rule               |
| test               |
+--------------------+

mysql> use hive;
Database changed

mysql> show tables;


3.2.启动Hive

 # $HIVE_HOME/bin/hive

.....

hive> 

3.3.测试

3.3.1.查看函数命令:

hive>show functions;
OK
...hive> describe database bigtreetrial;OK
bigtreetrial    hdfs://hadoopServer3:9000/data/hive/warehouse/bigtreetrial.db   root    USER
Time taken: 0.02 seconds, Fetched: 1 row(s)
hive> 
  • 3.3.2.查看sum函数的详细信息的命令:
hive> desc function sum;
OK
sum(x) - Returns the sum of a set of numbers
Time taken: 0.008 seconds, Fetched: 1 row(s)
  • 3.3.3.新建测试数据库和数据表
#新建数据库
hive> create database bigtreeTrial;
#新建数据表
hive> use bigtreeTrial;
hive> create table student(id int, name string) row format delimited fields terminated by '\t';
hive> desc student;
OK
id                      int                                         
name                    string                                      
Time taken: 0.114 seconds, Fetched: 2 row(s)hive> select * from student;
OK
Time taken: 1.089 seconds
  • 3.3.4.将文件写入到表中
3.3.4.1.在$HIVE_HOME下新建一个文件
# cd $HIVE_HOME
新建文件student.dat
# touch student.dat
在文件中添加如下内容
[root@apollo hive]# vim student.dat
001     daniel
002     bill
003     bruce
004     xin
  • 说明:ID和name直接是TAB键,不是空格,因为在上面创建表的语句中用了terminated by ‘\t’所以这个文本里id和name的分割必须是用TAB键(复制粘贴如果有问题,手动敲TAB键吧),还有就是行与行之间不能有空行,否则下面执行load,会把NULL存入表内,该文件要使用unix格式,如果是在windows上用txt文本编辑器编辑后在上载到服务器上,需要用工具将windows格式转为unix格式,例如可以使用Notepad++来转换。
3.3.4.2.导入数据
hive> load data local inpath '/opt/apache-hive-2.3.2-bin/student.dat' into table bigtreeTrial.student;
Loading data to table sbux.student
OK
Time taken: 4.844 seconds
  • 3.3.4.3查看导入数据是否成功
hive> use bigtreeTrial;
OK
Time taken: 0.022 seconds
hive> select * from student;
OK
1       daniel
2       bill
3       bruce
4       xin
Time taken: 1.143 seconds, Fetched: 4 row(s)
    • 3.3.4.4 在HDFS系统中查看数据
  • # $HADOOP_HOME/bin/hdfs dfs -ls  /data/hive/warehouse
    Found 1 items
    drwxrwxrwx   - root supergroup          0 2018-03-20 11:40 /data/hive/warehouse/bigtreetrial.db


  • 3.4.在界面上查看刚刚写入的hdfs数据

    在浏览器里打开如下的连接 (hadoop的namenode)来查看HIVE的HDFS信息。

 http://10.70.27.3:50070/explorer.html#/data/hive/warehouse

   说明:先打开 http://10.70.27.3:50070,然后在最右边的菜单Utilities -> Browse File System,  输入 /, 然后选择go, 就可以一步一步地浏览

HDFS信息了。

3.5.在mysql的hive数据里查看

mysql> select * from hive.TBLS;
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+
|      1 |  1521517202 |     6 |                0 | root  |         0 |     1 | student  | MANAGED_TABLE | NULL               | NULL               |                    |
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+--------------------+

1 row in set (0.00 sec)


  • 4.编译与patch (可选)
  •  这步和安装配置无关。
  • 在本次安装后,在使用过程中,发现hive与druid对接有问题,需要给hive打patch,但是这个时候的官方hive是没有这个patch的,就只能自己动手了。
  • 问题现象如下:

  • Druid broker 日志
    ==============
    2018-03-23T03:20:00,992 INFO [qtp2119918107-144] io.druid.java.util.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2018-03-23T03:20:00.992Z","service":"druid/broker","host":"10.70.27.12 :8082","version":"0.12.0","metric":"query/bytes","value":389,"context":"{\"queryId\":\"ae955617-7f55-4db6-a239-16a8acc85316\"}","dataSource":"druid_metrics","duration":"PT9223372036854775.807S","hasFilters":"false","id":"ae955617-7f55-4db6-a239-16a8acc85316","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"remoteAddress":"10.70.27.3","success":"true","type":"segmentMetadata"}]

  • 2018-03-23T03:17:24,459 ERROR [qtp2119918107-147] com.sun.jersey.spi.container.ContainerResponse - 
  • The RuntimeException could not be mapped to a response, re-throwing to the HTTP container
    java.lang.IllegalArgumentException:  Invalid format: "1900-01-01T08:05:43.000 08:05:43" is malformed at " 08:05:43"
            at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945) ~[joda-time-2.9.9.jar:2.9.9]
            at org.joda.time.convert.StringConverter.setInto(StringConverter.java:212) ~[joda-time-2.9.9.jar:2.9.9]
            at org.joda.time.base.BaseInterval.(BaseInterval.java:200) ~[joda-time-2.9.9.jar:2.9.9]
            at org.joda.time.Interval.(Interval.java:289) ~[joda-time-2.9.9.jar:2.9.9]
            at io.druid.java.util.common.Intervals.of(Intervals.java:38) ~[java-util-0.12.0.jar:0.12.0]
            at io.druid.server.ClientInfoResource.getQueryTargets(ClientInfoResource.java:303) ~[druid-server-0.12.0.jar:0.12.0]

  • fix solution:
    https://issues.apache.org/jira/browse/HIVE-16576

    这个fix 要3.0.0才有,我们目前只能手工打patch 步骤如下:

  • 4.1 下载hive source code
  •  http://www.apache.org/dyn/closer.cgi/hive/
  • 本次选择的是apache-hive-2.3.2-src.tar.gz。把下载了的源码包放到一个centOS的linux主机上。
  •  # tar zxvf apache-hive-2.3.2-src.tar.gz

  • 本次需要的patch地址为:
  • https://issues.apache.org/jira/browse/HIVE-16576
按patch里面的内容修改源代码并保持。然后到下一步去构架

  • 4.2 构建 from source code

 注: 本台机器上必须安装  jdk8 和 maven 工具。

1. 在maven 的 /usr/share/maven/conf/settings.xml 做如下的配置,可以加速构建。


    
      alimaven
      aliyun maven
      http://maven.aliyun.com/nexus/content/groups/public/
      central
    
  

2. cd  apache-hive-2.3.2-src; mvn clean package -Pdist -DskipTests

经过比较长的编译过程,等构建完毕。

# cd ./packaging/target/

该目录下就会有新生成的 apache-hive-2.3.2-bin.tar.gz。


  • 5.(可选)关于Hive和druid (0.9.x及其以后)的集成
  •   5.1 集成 jira: https://issues.apache.org/jira/browse/HIVE-14217 
  •   5.2 集成介绍的官方page:  https://cwiki.apache.org/confluence/display/Hive/Druid+Integration

第一步:配置和启动 tranquility 服务器

下载 tranquility-distribution-0.8.2.tar to /opt/

 step2:  # tar xvf download tranquility-distribution-0.8.2.tar

 step3: # cd /opt/tranquility-distribution-0.8.2/conf

      vi server.json  

{
  "dataSources" : {
    "pageviews" : {
      "spec" : {
        "dataSchema" : {
          "dataSource" : "pageviews",
          "parser" : {
            "type" : "string",
            "parseSpec" : {
              "timestampSpec" : {
                "format": "auto",
                "column": "time"
              },

              "dimensionsSpec" : {
               "dimensions": ["url", "user"]
              },

              "format" : "json"
            }
          },
          "granularitySpec" : {
            "type" : "uniform",
            "segmentGranularity" : "hour",
            "queryGranularity" : "none"
          },
          "metricsSpec" : [
                          {"name": "views", "type": "count"},
                          {"name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs"}
          ]
        },},
        "ioConfig" : {
          "type" : "realtime"
        },
        "tuningConfig" : {
          "type" : "realtime",
          "maxRowsInMemory" : "100000",
          "intermediatePersistPeriod" : "PT10M",
          "windowPeriod" : "PT10M"
        }
      },
      "properties" : {
        "task.partitions" : "1",
        "task.replicants" : "1"
      }
    } },
  "properties" : {
    "zookeeper.connect" : "10.70.27.8:2181,10.70.27.10:2181,10.70.27.12:2181",
    "druid.discovery.curator.path" : "/druid/discovery",
    "druid.selectors.indexing.serviceName" : "druid/overlord",
    "http.port" : "8200",
    "http.threads" : "8"
  }
}

启动tranquility server

#  cd /opt/tranquility-distribution-0.8.2 ; ./bin/tranquility server conf/server.json

....

2018-03-28 02:00:24,210 [main] INFO  o.e.jetty.server.ServerConnector - Started ServerConnector@406ca9fc{HTTP/1.1}{0.0.0.0:8200}
2018-03-28 02:00:24,210 [main] INFO  org.eclipse.jetty.server.Server - Started @3868ms


第二步:向 tranquility 服务器发送数据

post : http://10.70.27.8:8200/v1/post/pageviews

 // 10.70.27.8 是tranquility 服务器运行的地址。pageviews 是上面配置文件中的data source地址。

text/plain; raw

{"time": "2018-03-27T12:42:49Z", "url": "/foo/bar", "user": "billhongbin", "latencyMs": 45}


第三步:查看druid task

http://【overlord server IP】:8090/console.html

可以看到任务完成。

第四步,下载hive并配置hive中的druid设置


第五步,从hive中检索数据

# /opt/apache-hive-2.3.2-bin/bin/hive

hive> show databases;
OK
bigtreetrial
default
Time taken: 3.255 seconds, Fetched: 2 row(s)

hive> use bigtreetrial;

hive > CREATE EXTERNAL TABLE bill_druid_table
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "pageviews");

hive> describe formatted bill_druid_table;

OK
# col_name              data_type               comment             
                 
__time                  timestamp               from deserializer   
latencyms               string                  from deserializer   
url                     string                  from deserializer   
user                    string                  from deserializer   
views                   bigint                  from deserializer   
                 
# Detailed Table Information             
Database:               bigtreetrial             
Owner:                  root                     
CreateTime:             Tue Mar 27 20:48:43 CST 2018     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://hadoopServer3:9000/data/hive/warehouse/bigtreetrial.db/bill_druid_table   

Table Type:             EXTERNAL_TABLE           

Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        EXTERNAL                TRUE                
        druid.datasource        pageviews           
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        storage_handler         org.apache.hadoop.hive.druid.DruidStorageHandler
        totalSize               0                   

        transient_lastDdlTime   1522154923 

# Storage Information            
SerDe Library:          org.apache.hadoop.hive.druid.serde.DruidSerDe    
InputFormat:            null                     
OutputFormat:           null                     
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   

Time taken: 0.046 seconds, Fetched: 37 row(s)

hive> select * from bill_druid_table;
OK
2018-03-28 11:37:04     NULL    /datang/machine billtang        1
2018-03-28 11:37:04     NULL    /datang/machine tiger   1
2018-03-28 12:42:15     NULL    /datang/machine billtang        1
2018-03-28 12:48:15     NULL    /datang/machine billtang        1
2018-03-28 12:48:15     NULL    /sina/machine   bigtree 1
Time taken: 2.037 seconds, Fetched: 5 row(s)


你可能感兴趣的:(hadoop)