jdk1.8 + Hadoop2.7.3 + Spark2.2.0 + Scala2.11.8
hadoop 2.7之后的tar.gz包都是64位的
1 clone之前
1.1 安装vmware,安装centos7
网络连接选host-only
centos7选基础设施服务器(Infrastructure Server)
1.2 修改hostname,改网络配置,克隆之后需要分别改
在宿主机用ifconfig(/sbin/ifconfig)查看vmnet1虚拟网卡(对应于vmware的host-only模式)对应的网关(inet addr),这里是192.168.176.1
打算安装一台master
192.168.176.100 master
两台slave
192.168.176.101 slave1
192.168.176.102 slave2
hostnamectl set-hostname master
systemctl stop firewalld
systemctl disable firewalld
vi /etc/sysconfig/network-scripts/ifcfg-ens33 // "ens33" it depends
TYPE=Ethernet
IPADDR=192.168.176.100
NETMASK=255.255.255.0
GATEWAY=192.168.176.1
PEERDNS=no
vi /etc/sysconfig/network
NETWORKING=yes
GATEWAY=192.168.176.1
vi /etc/resolv.conf
nameserver 192.168.1.1
service network restart
现在已经可以ping通宿主机,用sftp上传安装文件,用ssh操作master, slave1, slave2
1.3 改hosts
vi /etc/hosts
192.168.176.100 master
192.168.176.101 slave1
192.168.176.102 slave2
1.4 解压jdk,hadoop,spark,scala...
cd /usr/local
tar -zxvf ... // it depends
修改profile
vim /etc/profile
JAVA_HOME=/usr/java/jdk1.8.0_144
JRE_HOME=$JAVA_HOME/jre
DERBY_HOME=$JAVA_HOME/db
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
CLASSPATH=:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export JAVA_HOME JRE_HOME DERBY_HOME PATH CLASSPATH
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export SCALA_HOME=/usr/local/scala-2.11.8
export SPARK_HOME=/usr/local/spark-2.2.0-bin-hadoop2.7
export HIVE_HOME=/usr/local/apache-hive-2.3.0-bin
export HBASE_HOME=/usr/local/hbase-2.0.0-alpha-1
export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.10
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME/bin
1.5 配置hadoop
mkdir tmp hdfs hdfs/data hdfs/name
分别修改 hadoop-env.sh, core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, slaves
默认配置查看如下:
core-default.xml
hdfs-site.xml
mapred-default.xml
yarn-default.xml
cd /usr/local/hadoop-2.7.3/etc/hadoop
vi hadoop-env.sh
// 修改JAVA_HOME=/usr/java/jdk1.8.0_144
vi core-site.xml
fs.defaultFS
hdfs://master:9000
hadoop.tmp.dir
/usr/local/hadoop-2.7.3/tmp
io.file.buffer.size
131072
vi hdfs-site.xml
dfs.replication
1
dfs.namenode.name.dir
file:/usr/local/hadoop-2.7.3/hdfs/name
dfs.datanode.data.dir
file:/usr/local/hadoop-2.7.3/hdfs/data
vi yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
// cp mapred-site.xml.templete mapred-site.xml
vi mapred-site.xml
mapreduce.framework.name
yarn
vi slaves
slave1
slave2
1.6 建立非root用户hadoop
useradd hadoop
passwd hadoop
// 给hadoop用户开文件权限
chown -R hadoop:hadoop ./hadoop-2.7.3
1.7 从CST转换为UTC:
cp -af /usr/share/zoneinfo/UTC /etc/localtime
date
1.8 spark配置
cd /usr/local/spark-2.2.0-bin-hadoop2.7/conf
vi slaves
slave1
slave2
vi spark-env.sh
# spark setting
export JAVA_HOME=/usr/java/jdk1.8.0_144
export SCALA_HOME=/usr/local/scala-2.11.8
export SPARK_MASTER_IP=master
export SPARK_WORKER_MEMORY=8g
export SPAKR_WORKER_CORES=4
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
2 clone之后
得到slave1, slave2,修改slave1和slave2的hostname和网络配置
2.1 root(或者hadoop)用户ssh免密码互联master, slave1, slave2
// cp ~
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
在root(hadoop)下分别把id_rsa.pub分别复制到其他两台机器的authorized_keys中,用ssh命令互相连接测试 ssh slave1, ssh slave2, ssh master
3 其他
3.1 宿主机为linux、windows分别实现VMware三种方式上网
http://linuxme.blog.51cto.com/1850814/389691
3.2 常用命令
jps
start-dfs.sh
start-yarn.sh
start-all.sh
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
netstat -ntlp
hadoop dfsadmin -report | more
hadoop
// web ui
http://192.168.176.100:50070
3.3 Hadoop FileSystem
http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html
hadoop fs -cat URI [URI …]
hadoop fs -cp URI [URI …]
hadoop fs -copyFromLocal URI // 除了限定源路径是一个本地文件外,和put命令相似。
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI // 除了限定目标路径是一个本地文件外,和get命令类似。
hadoop fs -du URI [URI …]
hadoop fs -dus
hadoop fs -get
hadoop fs -put
hadoop fs -ls
hadoop fs -lsr // 递归版的ls
hadoop fs -mkdir // 只能一级级的建目录
hadoop fs -mv URI [URI …] // 将文件从源路径移动到目标路径
hadoop fs -rm
hadoop fs -rmr // 递归版的rm
hadoop dfs, hadoop fs, hdfs dfs的区别
Hadoop fs:使用面最广,可以操作任何文件系统。
hadoop dfs与hdfs dfs:只能操作HDFS文件系统相关(包括与Local FS间的操作),前者已经Deprecated,一般使用后者。
4. Hive部署
在安装Hive前,先安装MySQL,以MySQL作为元数据库,Hive默认的元数据库是内嵌的Derby,但因其有单会话限制,所以选用MySQL。MySQL部署在hadoop-master节点上,Hive服务端也安装在hive-master上
元数据(Metadata),又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据算是一种电子式目录,为了达到编制目录的目的,必须在描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
4.1 Hive环境变量配置(见1.4)
4.2 Hive配置
http://www.jianshu.com/p/978a77a1d6a2
将$HIVE_HOME/conf/下的两个文件重命名:
mv hive-default.xml.template hive-site.xml
mv hive-env.sh.template hive-env.sh
vim hive-env.sh
配置其中的HADOOP_HOME,将HADOOP_HOME前面的#号去掉
vim hive-site.xml(到处参考)
hive.metastore.schema.verification // false
// 在hive目录下创建tmp文件夹
${system:java.io.tmpdir} 改为tmp目录
${system:user.name} 改为用户名,这里是root
修改连接mysql的jdbc
启动Hive 的 Metastore Server服务进程
nohup是永久执行,执行结果会在当前目录生成一个nohup.out日志文件,可以查看执行信息
&是指在后台运行
hive --service metastore &
// 推荐使用nohup启动,不会随着对话结束而停止
nohup hive --service metastore &
Hive第一次登录需要初始化(*)
schematool -dbType mysql -initSchema
4.3 MySQL安装方式,建议第一种方式
4.3.1 Linux-Generic
官网下载MySQL Community Server,操作系统选Linux-Generic
https://www.bilibili.com/video/av6147498/?from=search&seid=673467972510968006
http://blog.csdn.net/u013980127/article/details/52261400
自己用这种方式安装的
- 安装
检查库文件是否存在,如有删除。
rpm -qa | grep mysql
官网下载Linux - Generic (glibc 2.12) (x86, 64-bit), Compressed TAR Archive
https://dev.mysql.com/downloads/mysql/
解压即可
tar -xvf mysql-5.7.19-linux-glibc2.12-i686.tar.gz
- 检查mysql组和用户是否存在,如无创建mysql:mysql
cat /etc/group | grep mysql
cat /etc/passwd | grep mysql
groupadd mysql
useradd -r -g mysql mysql
- 修改资源使用配置文件
sudo vim /etc/security/limits.conf
mysql hard nofile 65535
mysql soft nofile 65535
- 初始化,启动一个实例
vim /etc/my.cnf
[mysqld]
port=3306
socket=/tmp/mysql.sock
user=mysql
datadir=...
...
启动实例,注意修改里面的内容
cd to top dir of mysql
bin/mysql_install_db --user=mysql --basedir=/usr/local/mysql/ --datadir=/usr/local/mysql/data/
- 初始化root用户的密码为12345
第一次启动,使用初始化密码
cat /root/.mysql_secret
启动mysql实例,敲入/root/.mysql_secret中的密码
mysql uroot -p // 这里myql加入到了环境变量中
添加mysql环境变量
export PATH=$PATH:/usr/local/mysql/bin
进去之后修改密码
SET PASSWORD = PASSWORD('12345');
flush privileges;
下次启动时使用修改后的密码
mysql uroot -p // 密码12345
- 继续,添加远程访问权限
use mysql;
update user set host = '%' where user = 'root';
重启服务生效
/etc/init.d/mysqld restart
- 为master创建hive用户,密码为12345,用来链接hive
mysql>CREATE USER 'hive' IDENTIFIED BY '12345';
mysql>GRANT ALL PRIVILEGES ON *.* TO 'hive'@'master' WITH GRANT OPTION;
mysql>flush privileges;
启动方式
mysql -h master -uhive -p
- 设置为开机自启动
sudo chkconfig mysql on
4.3.2 Yum Repository
wget http://repo.mysql.com/mysql57-community-release-el7-11.noarch.rpm
// 或者到https://dev.mysql.com/downloads/repo/yum/下载rpm
rpm -ivh mysql57-community-release-el7-11.noarch.rpm
yum install mysql-server
4.4 spark sql 支持hive
按照官方doc的说法,只需要把$HIVE_HOME/conf下hive-site.xml, core-site.xml文件copy到$SPARK_HOME/conf下即可,同时用scp传到slave机器上
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
4.5 Hive 操作
Hive四种数据导入方式
http://blog.csdn.net/lifuxiangcaohui/article/details/40588929
分区:在Hive中,表的每一个分区对应表下的相应目录,所有分区的数据都是存储在对应的目录中。比如wyp表有dt和city两个分区,则对应dt=20131218,city=BJ对应表的目录为/user/hive/warehouse/dt=20131218/city=BJ,所有属于这个分区的数据都存放在这个目录中。
UDF(User-Defined-Function),用户自定义函数对数据进行处理。UDF函数可以直接应用于select语句,对查询结构做格式化处理后,再输出内容。自定义UDF需要继承org.apache.hadoop.hive.ql.UDF。需要实现evaluate函数。evaluate函数支持重载。
http://blog.csdn.net/dajuezhao/article/details/5753001
spark UDF org.apache.spark.sql.expressions.UserDefinedAggregateFunction
http://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-user-defined-aggregate-functions
Hive 创建名为dual的测试表
create table dual (dummy string);
// 退出hive进入bash
echo 'X' > dual.txt
// 进入hive
load data local inpath '/home/hadoop/dual.txt' overwrite into table daul;
Hive 正则表达式
http://blog.csdn.net/bitcarmanlee/article/details/51106726
HIVE json格式数据的处理
http://www.cnblogs.com/casicyuan/p/4375080.html
5. 常用问题
hadoop多次格式化后,导致datanode启动不了
http://blog.csdn.net/longzilong216/article/details/20648387
MapReduce任务运行到running job卡住
http://blog.csdn.net/yang398835/article/details/52205487