大数据集群环境配置

目录

      • 大数据环境环境准备
      • MySQL的安装部署
        • 1.查看系统自带的mysql的rpm包
        • 2.mysql数据库的安装
      • Hadoop的安装部署
        • 1.hadoop下载地址
        • 2.hadoop安装部署
      • Yarn 的安装部署
        • 1.1 Configure parameters as follows
        • 1.2 Start ResourceManager daemon and NodeManager daemon
        • 1.3 查看resourcemanager进程
        • 2.运行案例测试yarn
          • 2.1 寻找yarn测试jar包
          • 2.2 查看命令帮助,确认怎么运行jar包
          • 2.3 运行jar包:
          • 2.4 提示输入参数,这里做wordcount案例,所以再次运行
          • 2.5 我们创建文件
          • 2.6 再次运行
          • 2.7 查看结果
        • 3. 修改机器名称:
          • 3.1 查看当前机器名称
          • 3.2 查看命令帮助:
          • 3.3 使用的命令如下:
      • Hive 的安装部署
        • Hive日志查看
        • Hive中的命令:
      • Sqoop 的安装部署
        • RDBMS TO HDFS(Import)
          • 全表导入
          • 导入指定列
          • 查询导入
        • 筛选导入
        • 增量导入数据
        • RDBMS To Hive(Import)
        • HDFS To RDBMS(Export)
        • Hive To RDBMS(Export)

大数据环境环境准备

以下为单节点的测试环境搭建,仅供学习测试使用

  • 1.卸载Linux系统自带的openJDK:

    rpm -qa | grep java 查看系统自带的jdk,然后运行rpm -e --nodeps *** 删除系统自带的jdk 注意 带有nohach的不用卸载,

  • 2.linux中进行传输文件:

    lrzsz的安装:yum -y install lrzsz

  • 3.centos7开启远程登录

用root用户执行vi /etc/ssh/sshd_config 
将 #PermitRootLogin yes
这一行的“#”去掉,修改为:
PermitRootLogin yes
  • 4.设置静态ip地址

用ifconfig查看当前网卡的名字 我这里是:ifcfg-ens33
网络IP地址配置文件在 /etc/sysconfig/network-scripts 文件夹下

vim ifcfg-ens33

主要修改的是:

BOOTPROTO="static"         # 使用静态IP地址,默认为dhcp
IPADDR="192.168.52.100"   # 设置的静态IP地址
NETMASK="255.255.255.0"    # 子网掩码
GATEWAY="192.168.52.10"    # 网关地址
DNS1="192.168.52.10"       # DNS服务器

全部的配置文件如下:

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"         # 使用静态IP地址,默认为dhcp
IPADDR="192.168.52.50"   # 设置的静态IP地址
NETMASK="255.255.255.0"    # 子网掩码
GATEWAY="192.168.52.10"    # 网关地址
DNS1="192.168.52.10"       # DNS服务器
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="95b614cd-79b0-4755-b08d-99f1cca7271b"
DEVICE="ens33"
ONBOOT="yes"             #是否开机启用
  • 5.centos7关闭防火墙

查看防火墙是否开启:firewall-cmd --state

停止防火墙:systemctl stop firewalld.service

禁止防火墙开机自启动:systemctl disable firewalld.service

  • 6.centos7关闭selinux
vim /etc/selinux/config

修改 SELINUX=enforcing 改为 SELINUX=disabled

  • 7.安装JDK
    将jdk统一安装到/usr/java 下面

    mkdir /usr/java
    //解压tar包
    tar -xzvf jdk-8u45-linux-x64.tar.gz -C /usr/java/
    #切记必须修正所属⽤户及⽤户组
    chown -R root:root /usr/java/jdk1.8.0_45
    

    配置环境变量:
    vim /etc/profile

    配置环境变量:
    vim /etc/profile

    export JAVA_HOME=/usr/java/jdk1.8.0_45
    export PATH=$JAVA_HOME/bin:$PATH
    
    source /etc/profile
    

MySQL的安装部署

1.查看系统自带的mysql的rpm包

rpm -qa | grep mysql
rpm -e mysql-libs-5.1.73-8.el6_8.x86_64 --nodeps

2.mysql数据库的安装

第一步:在线安装mysql相关的软件包

yum  install  mysql  mysql-server  mysql-devel

第二步:启动mysql的服务

/etc/init.d/mysqld start

第三步:通过mysql安装自带脚本进行设置

/usr/bin/mysql_secure_installation

第四步:进入mysql的客户端然后进行授权

 grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
 flush privileges;

第五步:登录MySQL检查是否能正常使用

Hadoop的安装部署

1.hadoop下载地址

Apache Hadoop 地址:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html
Cloudera Hadoop 地址:http://archive.cloudera.com/cdh5/cdh/5/
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2.tar.gz

当前版本的Hadoop出现问题时,可以到changes.log里面查看高版本是否将此补丁修复
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2-changes.log

2.hadoop安装部署

2.1 创建hadoop用户

useradd hadoop
[root@bigdata01 ~]# id hadoop
uid=1001(hadoop) gid=1002(hadoop) groups=1002(hadoop)

2.2 切换到hadoop用户

[root@bigdata01 ~]# su - hadoop

2.3 创建文件夹

[hadoop@bigdata01 ~]$ mkdir app software sourcecode log tmp data lib
[hadoop@bigdata01 ~]$ ll
total 0
drwxrwxr-x 3 hadoop hadoop 50   1 22:17 app
drwxrwxr-x 2 hadoop hadoop  6   1 22:10 data
drwxrwxr-x 2 hadoop hadoop  6   1 22:10 lib
drwxrwxr-x 2 hadoop hadoop  6   1 22:10 log
drwxrwxr-x 2 hadoop hadoop 43   1 22:14 software
drwxrwxr-x 2 hadoop hadoop  6   1 22:10 sourcecode
drwxrwxr-x 2 hadoop hadoop 22   2 12:36 tmp
[hadoop@bigdata01 ~]$ 

2.4 上传压缩包到software目录解压到app目录

[hadoop@bigdata01 software]$ tar -xzvf hadoop-2.6.0-cdh5.16.2.tar.gz -C ../app/
[hadoop@bigdata01 app]$ ll
drwxr-xr-x 14 hadoop hadoop 241 Jun  3 19:11 hadoop-2.6.0-cdh5.16.2

2.5 做软连接

[hadoop@bigdata01 app]$ ln -s hadoop-2.6.0-cdh5.16.2/ hadoop
[hadoop@bigdata01 app]$ ll
lrwxrwxrwx  1 hadoop hadoop  23 Dec  1 22:17 hadoop -> hadoop-2.6.0-cdh5.16.2/
drwxr-xr-x 14 hadoop hadoop 241 Jun  3 19:11 hadoop-2.6.0-cdh5.16.2

2.6 检查jdk

[hadoop@bigdata01 app]$ which java
/usr/java/jdk1.8.0_121/bin/java

2.7 配置环境变量

[hadoop@bigdata01 ~]$ vim .bashrc 
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

2.8 检查环境变量

[hadoop@bigdata01 ~]$ source .bashrc 
[hadoop@bigdata01 ~]$ which hadoop
~/app/hadoop/bin/hadoop
[hadoop@bigdata01 ~]$ echo $HADOOP_HOME
/home/hadoop/app/hadoop

2.9 查看hadoop命令帮助

[hadoop@bigdata01 ~]$ hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  s3guard              manage data on S3
  trace                view and modify Hadoop tracing settings
 or
  CLASSNAME            run the class named CLASSNAME

Apache Hadoop 文档:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html

Cloudera Hadoop 文档:http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2/hadoop-project-dist/hadoop-common/SingleCluster.html

1.10 安装ssh

[hadoop@bigdata01 ~]$ cd ~
执行ssh-keygen 三次回车键 会生成.ssh文件夹
$ ssh-keygen
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

采坑:
在进行ssh登录的时候输入yes这句话是维护在know_hosts中的,遇到问题时候可以到文件中删除对应的秘钥,这个文件中存储了ssh的秘钥文件

[hadoop@bigdata01 ~]$ cd .ssh/
[hadoop@bigdata01 .ssh]$ cat known_hosts 
bigdata01 ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBOwZK88+GuH93o6h17DEP19Ly+m79cw1rpjXTcmqlBOviTG0d8mXGmJoBDpPf/pQA49tWqgeVFcsDfBr9YdCK5w=
。。。。。

2.11 格式化hdfs

[hadoop@bigdata01 ~]$ hdfs namenode -format
当出现。。。。 has been successfully formatted时候说明成功执行
[hadoop@bigdata01 ~]$ start-dfs.sh 
20/07/13 14:28:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [bigdata01]
bigdata01: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-bigdata01.out
bigdata01: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-bigdata01.out
Starting secondary namenodes [bigdata01]
bigdata01: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-bigdata01.out
20/07/13 14:28:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@bigdata01 ~]$ 
[hadoop@bigdata01 ~]$ jps
11536 Jps
11416 SecondaryNameNode
11258 DataNode
11131 NameNode

配置DataNode和SecondaryNameNode都以bigdata01启动

NameNode的启动是由core-site.xml中的fs.defaultFS控制的

DataNode的启动是由salves文件中机器名称控制的

SecondaryNameNode的启动是由hdfs-site.xml控制的

<property>
	<name>dfs.namenode.secondary.http-addressname>
	<value>bigdata01:50090value>
property>
<property>
	<name>dfs.namenode.secondary.https-addressname>
	<value>bigdata01:50091value>
property>

测试

[hadoop@bigdata01 hadoop]$ hadoop fs -mkdir /hadooptest
[hadoop@bigdata01 hadoop]$ hadoop fs -ls /
drwxr-xr-x   - hadoop supergroup          0 2020-07-14 12:35 /hadooptest
[hadoop@bigdata01 tmp]$ vim test.txt
[hadoop@bigdata01 tmp]$ hadoop fs -put test.txt /hadooptest
[hadoop@bigdata01 tmp]$ hadoop fs -ls /hadooptest
-rw-r--r--   1 hadoop supergroup         42 2020-07-14 12:36 /hadooptest/test.txt
[hadoop@bigdata01 tmp]$ hadoop fs -cat /hadooptest/test.txt
hadoop hive spark flink impala kudu flume

Yarn 的安装部署

Yarn的单点部署文档:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html

1.1 Configure parameters as follows

etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.namename>
        <value>yarnvalue>
    property>
configuration>

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-servicesname>
        <value>mapreduce_shufflevalue>
    property>
configuration>

1.2 Start ResourceManager daemon and NodeManager daemon

[hadoop@bigdata01 ~]$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-hadoop-resourcemanager-bigdata01.out
bigdata01: starting nodemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-hadoop-nodemanager-bigdata01.out
[hadoop@bigdata01 ~]$ jps
11857 SecondaryNameNode
11570 NameNode
11698 DataNode
12002 ResourceManager
12391 Jps
12105 NodeManager

1.3 查看resourcemanager进程

[hadoop@bigdata01 ~]$ netstat -nlp |grep 12002
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6       0      0 :::8088                 :::*                    LISTEN      12002/java   

2.运行案例测试yarn

2.1 寻找yarn测试jar包
[root@bigdata01 ~]# find / -name '*example*.jar'
/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce1/hadoop-examples-2.6.0-mr1-cdh5.16.2.jar
/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce2/sources/hadoop-mapreduce-examples-2.6.0-cdh5.16.2-test-sources.jar
/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce2/sources/hadoop-mapreduce-examples-2.6.0-cdh5.16.2-sources.jar
/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar
最后一个为我们需要的案例jar包
2.2 查看命令帮助,确认怎么运行jar包
[hadoop@bigdata01 ~]$ hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  。。。。

Most commands print help when invoked w/o parameters.
2.3 运行jar包:
[hadoop@bigdata01 ~]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[hadoop@bigdata01 ~]$ 
2.4 提示输入参数,这里做wordcount案例,所以再次运行
[hadoop@bigdata01 ~]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar wordcount
Usage: wordcount <in> [<in>...] <out>

提示需要输入和输出路径

2.5 我们创建文件
[hadoop@bigdata01 ~]$ hadoop fs -cat /wordcount/test/test1.txt
20/07/13 21:45:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hadoop hadoop hadoop spark flume spark flink hive hue 
flink hbase kafka kafka spark hadoop hive 
2.6 再次运行
[hadoop@bigdata01 ~]$ hadoop jar /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar wordcount /wordcount/test /wordcount/output
20/07/13 21:46:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/13 21:46:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/13 21:46:46 INFO input.FileInputFormat: Total input paths to process : 1
20/07/13 21:46:46 INFO mapreduce.JobSubmitter: number of splits:1
20/07/13 21:46:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1575293526101_0001
20/07/13 21:46:46 INFO impl.YarnClientImpl: Submitted application application_1575293526101_0001
20/07/13 21:46:47 INFO mapreduce.Job: The url to track the job: http://bigdata01:8088/proxy/application_1575293526101_0001/
20/07/13 21:46:47 INFO mapreduce.Job: Running job: job_1575293526101_0001
20/07/13 21:46:57 INFO mapreduce.Job: Job job_1575293526101_0001 running in uber mode : false
20/07/13 21:46:57 INFO mapreduce.Job:  map 0% reduce 0%
20/07/13 21:47:03 INFO mapreduce.Job:  map 100% reduce 0%
20/07/13 21:47:10 INFO mapreduce.Job:  map 100% reduce 100%
20/07/13 21:47:10 INFO mapreduce.Job: Job job_1575293526101_0001 completed successfully
20/07/13 21:47:10 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=100
		FILE: Number of bytes written=286249
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=209
		HDFS: Number of bytes written=62
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=4554
		Total time spent by all reduces in occupied slots (ms)=2929
		Total time spent by all map tasks (ms)=4554
		Total time spent by all reduce tasks (ms)=2929
		Total vcore-milliseconds taken by all map tasks=4554
		Total vcore-milliseconds taken by all reduce tasks=2929
		Total megabyte-milliseconds taken by all map tasks=4663296
		Total megabyte-milliseconds taken by all reduce tasks=2999296
	Map-Reduce Framework
		Map input records=2
		Map output records=16
		Map output bytes=160
		Map output materialized bytes=100
		Input split bytes=111
		Combine input records=16
		Combine output records=8
		Reduce input groups=8
		Reduce shuffle bytes=100
		Reduce input records=8
		Reduce output records=8
		Spilled Records=16
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=108
		CPU time spent (ms)=1520
		Physical memory (bytes) snapshot=329445376
		Virtual memory (bytes) snapshot=5455265792
		Total committed heap usage (bytes)=226627584
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=98
	File Output Format Counters 
		Bytes Written=6
2.7 查看结果
[hadoop@bigdata01 ~]$ hadoop fs -ls /wordcount
20/07/13 21:48:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2020-07-14 21:47 /wordcount/output
drwxr-xr-x   - hadoop supergroup          0 2020-07-14 21:45 /wordcount/test
[hadoop@bigdata01 ~]$ hadoop  -ls /wordcount/output
Error: No command named `-ls' was found. Perhaps you meant `hadoop ls'
[hadoop@bigdata01 ~]$ hadoop fs -ls /wordcount/output
20/07/13 21:48:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2020-07-14 21:47 /wordcount/output/_SUCCESS
-rw-r--r--   1 hadoop supergroup         62 2020-07-14 21:47 /wordcount/output/part-r-00000
[hadoop@bigdata01 ~]$ hadoop fs -cat /wordcount/output/part-r-00000
20/07/13 21:49:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
flink	2
flume	1
hadoop	4
hbase	1
hive	2
hue	1
kafka	2
spark	3
[hadoop@bigdata01 ~]$

3. 修改机器名称:

3.1 查看当前机器名称
[hadoop@bigdata01 ~]$ hostnamectl
   Static hostname: bigdata01
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 928fc74e61be492eb9a51cc408995739
           Boot ID: 32e41529ec49471dba619ba744be31b1
    Virtualization: vmware
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-957.el7.x86_64
      Architecture: x86-64
[hadoop@bigdata01 ~]$ 
3.2 查看命令帮助:
[hadoop@bigdata01 ~]$ hostnamectl --help
hostnamectl [OPTIONS...] COMMAND ...

Query or change system hostname.

  -h --help              Show this help
     --version           Show package version
     --no-ask-password   Do not prompt for password
  -H --host=[USER@]HOST  Operate on remote host
  -M --machine=CONTAINER Operate on local container
     --transient         Only set transient hostname
     --static            Only set static hostname
     --pretty            Only set pretty hostname

Commands:
  status                 Show current hostname settings
  set-hostname NAME      Set system hostname
  set-icon-name NAME     Set icon name for host
  set-chassis NAME       Set chassis type for host
  set-deployment NAME    Set deployment environment for host
  set-location NAME      Set location for host
3.3 使用的命令如下:
set-hostname NAME      Set system hostname
[hadoop@bigdata01 ~]$ hostnamectl xxx bigdata01

[hadoop@bigdata01 ~]$ cat /etc/hostname 
bigdata01

修改主机名称之后需要修改对应的hosts文件中的ip地址和主机名称的映射

Hive 的安装部署

官网参考地址:https://cwiki.apache.org/confluence/display/Hive/GettingStarted

1.前置条件

  • JDK8
  • Hadoop2.x
  • Linux
  • Mysql

2.下载tar包
wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.16.2.tar.gz

3.解压tar -zxvf hive-1.1.0-cdh5.16.2.tar.gz -C ~/app/

4.修正用户和用户组

chmod -R hadoop:hadoop /home/hadoop/app/hive/* /home/hadoop/app/hive-1.1.0-cdh5.16.2/*

5.软连接

ln -s hive-1.1.0-cdh5.16.2/ hive

6.配置环境变量

[hadoop@bigdata01 app]$ cd ~
[hadoop@bigdata01 ~]$ vim .bashrc
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$HIVE_HOME/bin:$PATH

[hadoop@bigdata01 ~]$ source .bashrc
[hadoop@bigdata01 ~]$ which hive
~/app/hive/bin/hive

7.拷贝MySQL驱动包到$HIVE_HOME/lib/下

8.Hive 配置文件

Hive中是没有hive的template的,需要自己创建一个hive-site.xml




<configuration>
	<property>
	  <name>javax.jdo.option.ConnectionURLname>
	  <value>jdbc:mysql://bigdata01:3306/bigdata_hive?createDatabaseIfNotExist=truevalue>
	property>
	
	<property>
	  <name>javax.jdo.option.ConnectionDriverNamename>
	  <value>com.mysql.jdbc.Drivervalue>
	property>

	<property>
	  <name>javax.jdo.option.ConnectionUserNamename>
	  <value>rootvalue>
	property>

	<property>
	  <name>javax.jdo.option.ConnectionPasswordname>
	  <value>123456value>
	property>
	<property>
		<name>hive.cli.print.current.dbname>
		<value>truevalue>
	property>
	
	<property>
		<name>hive.cli.print.headername>
		<value>truevalue>
	property>
configuration>	

Hive日志查看

进入Hive客户端执行show databases;报错

FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

进入/tmp/hadoop查看hive.log 完整的日志信息

Unable to open a test connection to the given database. JDBC url = jdbc:mysql://192.168.52.50:3306/bigdata_hive?createDatabaseIfNotExist=true, username = root. Terminating connection pool (set lazyInit to true
 if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'root'@'bigdata01' (using password: YES)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
	at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:928)
	at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1750)
	at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1290)
	at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2493)
	at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2526)
	at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2311)
	at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:834)
	at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
	at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:416)
	at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:347)
	at java.sql.DriverManager.getConnection(DriverManager.java:664)
	at java.sql.DriverManager.getConnection(DriverManager.java:208)
	at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
	at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:416)
	at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)
	at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:501)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:298)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
	at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
	at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
	at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
	at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
	at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
	at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
	at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
	at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
	at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:420)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:449)
	at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:344)
	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:300)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:685)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:663)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:712)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:511)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6517)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:207)
	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1660)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:68)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:83)
	at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3412)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3431)
	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3656)
	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:232)
	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:216)
	at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:339)
	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:300)
	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:275)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.createHiveDB(BaseSemanticAnalyzer.java:201)
	at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.<init>(DDLSemanticAnalyzer.java:222)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory.get(SemanticAnalyzerFactory.java:265)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:546)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1358)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1475)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1287)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1277)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:226)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:141)

错误分析:不能实例化hive的元数据信息,可能的原因是mysql数据库未连接成功,检查myslq服务,以及hive-site.xml 中配置的mysql的连接信息,mysql的驱动jar包是否有,这里的问题应该是mysql用户名密码配置有误。

原因:hive-site.xml 中mysql的用户名和密码配置错误

Hive中的命令:

  • !clear; 清屏命令
  • exit; 退出命令
  • use dbname; 切换到dbname所在数据库
  • show tables; 查看当前数据库下的所有表

创建表:create table stu(id int, name string, age int);

查看表结构

desc stu;
desc extended stu;
desc formatted stu;
show create table stu;

插入数据:insert into stu values(1,‘tom’,30);

查询数据:select * from stu;

创建stu表,默认存储在HDFS的目录
hdfs://bigdata001:8020/user/hive/warehouse/stu
hive.metastore.warehouse.dir:/user/hive/warehouse
stu:表的名字
表的完整路径是: ${hive.metastore.warehouse.dir}/tablename

Hive的完整执行日志:
	cd $HIVE_HOME
	cp hive-log4j.properties.template hive-log4j.properties
		hive.log.dir=${java.io.tmpdir}/${user.name}
		hive.log.file=hive.log
		
		${java.io.tmpdir}/${user.name}/${hive.log.file}
		/tmp/hadoop/hive.log

cat hive-log4j.properties

[hadoop@bigdata01 conf]$ cat hive-log4j.properties
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Define some default values that can be overridden by system properties
hive.log.threshold=ALL
hive.root.logger=WARN,DRFA
hive.log.dir=${java.io.tmpdir}/${user.name}
hive.log.file=hive.log

# Define the root logger to the system property "hadoop.root.logger".
log4j.rootLogger=${hive.root.logger}, EventCounter

# Logging Threshold
log4j.threshold=${hive.log.threshold}

#
# Daily Rolling File Appender
#
# Use the PidDailyerRollingFileAppend class instead if you want to use separate log files
# for different CLI session.
#
# log4j.appender.DRFA=org.apache.hadoop.hive.ql.log.PidDailyRollingFileAppender

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender

log4j.appender.DRFA.File=${hive.log.dir}/${hive.log.file}

# Rollver at midnight
log4j.appender.DRFA.DatePattern=.yyyy-MM-dd

# 30-day backup
#log4j.appender.DRFA.MaxBackupIndex=30
log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

# Pattern format: Date LogLevel LoggerName LogMessage
#log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
# Debugging Pattern format
log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n


#
# console
# Add "console" to rootlogger above if you want to use this
#

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
log4j.appender.console.encoding=UTF-8

#custom logging levels
#log4j.logger.xxx=DEBUG

#
# Event Counter Appender
# Sends counts of logging messages at different severity levels to Hadoop Metrics.
#
log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounter


log4j.category.DataNucleus=ERROR,DRFA
log4j.category.Datastore=ERROR,DRFA
log4j.category.Datastore.Schema=ERROR,DRFA
log4j.category.JPOX.Datastore=ERROR,DRFA
log4j.category.JPOX.Plugin=ERROR,DRFA
log4j.category.JPOX.MetaData=ERROR,DRFA
log4j.category.JPOX.Query=ERROR,DRFA
log4j.category.JPOX.General=ERROR,DRFA
log4j.category.JPOX.Enhancer=ERROR,DRFA


# Silence useless ZK logs
log4j.logger.org.apache.zookeeper.server.NIOServerCnxn=WARN,DRFA
log4j.logger.org.apache.zookeeper.ClientCnxnSocketNIO=WARN,DRFA

#custom logging levels
log4j.logger.org.apache.hadoop.hive.ql.parse.SemanticAnalyzer=INFO
log4j.logger.org.apache.hadoop.hive.ql.Driver=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.ExecDriver=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.MapRedTask=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.Task=INFO
log4j.logger.org.apache.hadoop.hive.ql.session.SessionState=INFO
[hadoop@bigdata01 conf]$ 

hive.log.dir=${java.io.tmpdir}/${user.name}为Hive的日志目录
{java.io.tmpdir}为 /tmp 目录
hive.log.file=hive.log 为Hive的log日志文件

Sqoop 的安装部署

下载 cdh5
wget http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.16.2.tar.gz

解压
tar -zxvf sqoop-1.4.6-cdh5.16.2.tar.gz -C ~/app/

配置系统环境变量
export SQOOP_HOME=/home/hadoop/app/sqoop-1.4.6-cdh5.16.2
export PATH= S Q O O P H O M E / b i n : SQOOP_HOME/bin: SQOOPHOME/bin:PATH
source

配置文件:

 $SQOOP_HOME/conf/
 cp sqoop-env-template.sh sqoop-env.sh
 export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2
 export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2
 export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.16.2

驱动包

cp mysql-connector-java-5.1.27-bin.jar $SQOOP_HOME/lib/

测试是否可用:

列出所有数据库

sqoop list-databases \
--connect jdbc:mysql://bigdata01:3306 \
--password 123456 \
--username root

列出所有表

sqoop list-tables \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password 123456 \
--username root

RDBMS TO HDFS(Import)

全表导入
sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

参数:

  • –table :指定被导入的表名
  • –target-dir:指定导入路径
  • –delete-target-dir:如果目标目录存在就删除它
  • –fields-terminated-by:指定字段分隔符
  • –columns :指定需要导入的列
  • -m:mapTask的个数
  • –fields-terminated-by :指定字段之间的分隔符
导入指定列
sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--columns "id,name" \
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

参数:

  • –columns :指定要导入的列多个列之间用逗号分隔
查询导入
sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query 'select id,name from student where id <=1 and $CONDITIONS;'

参数:

  • –query :指定查询SQL where条件要有$CONDITIONS
    注意: must contain ‘$CONDITIONS’ in WHERE clause.

如果query后使用的是双引号,则$CONDITIONS前必须加转义符,防止shell识别为自己的变量。

筛选导入

sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"
--where "id > 400"

增量导入数据

sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--target-dir /user/company \
--null-string "" \
--null-non-string "0" \
--check-column "id" \
--incremental append \
--fields-terminated-by '\t' \
--last-value 0
-m 1

参数:

  • –null-string:字符串为null怎么处理
  • –null-non-string:其他类型为null怎么处理
  • –check-column:根据哪一行做增量导入
  • –last-value:开始增量导入的上个位置

RDBMS To Hive(Import)

sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--hive-overwrite \
--delete-target-dir \
--null-string "" \
--null-non-string "0" \
--hive-import \
--hive-database default \
--hive-table staff \
--fields-terminated-by '\t' \
--num-mappers 1

参数:

  • –hive-import:数据从关系数据库中导入到hive表中
  • –hive-overwrite:覆盖掉在hive表中已经存在的数据
  • –hive-table:后面接hive表,默认使用MySQL的表名 如果导入的是分区表,需要指定分区的key和value
  • –hive-partition-key key \
  • –hive-partition-value value \

HDFS To RDBMS(Export)

先保证MySQL创建了一张和Hive一样表结构的表用来接收数据
注意表结构和分隔符都要一样

sqoop import \
-Dsqoop.export.records.per.statement=10 \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--table staff \
--export-dir /user/company/ \
--null-string "" \
--null-non-string "0" \
--columns "id,name" \
--fields-terminated-by '\t' \
-m 1

参数:

  • –Dsqoop.export.records.per.statement:批量更新,每隔10条提交一次
  • –export-dir:导出的hdfs目录
  • –table:导入的表名
  • –columns:指定导入的列

注意:MySQL中表不存在会自动创建

Hive To RDBMS(Export)

sqoop import \
--connect jdbc:mysql://bigdata01:3306/sqoop \
--password bigdata --username root \
--table emp  \
--num-mappers 1 \
--export-dir /user/hive/warehouse/staff \
--input-fields-terminated-by "\t"

参数:

  • –export-dir:指定被导出的目录
  • –input-fields-terminated-by:导入的分隔符格式,和导入的fields-terminated-by有区别

你可能感兴趣的:(BigData,系列)