这是一篇spark环境的安装文档,不知道为什么查了下网上的安装步骤总是感觉怪怪的,有把环境变量配置到spark-env.sh的,有配置了yarn然后启动spark-standalone服务的,虽然不能保证我的方法是最标准的,但是至少我觉得比较合理
安装参考
- Spark on yarn的安装: http://wuchong.me/blog/2015/04/04/spark-on-yarn-cluster-deploy/
- hive安装 :http://dblab.xmu.edu.cn/blog/install-hive/
解压
- 下载java、scala、hadoop、spark、hive、kafka、python3.x(pyspark使用)压缩包
- 解压
tar -zxf xxx.tar.gz
tar -zxf xxx.tgz
- 添加
{xxx}_HOME
到全局环境变量
export {xxx}_HOME=
到 /etc/profile -
source /etc/profile
来使得环境变量生效 - 附:我的配置
export JAVA_HOME=/opt/jdk1.8.0_161
export SCALA_HOME=/opt/scala-2.11.11
export HADOOP_HOME=/opt/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_YARN_USER_ENV=${HADOOP_CONF_DIR}
export SPARK_HOME=/opt/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/opt/hive-2.3.3-bin
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export PYTHON_HOME=/usr/local/python-3.6.5
export PATH=${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${HIVE_HOME}/bin:${PYTHON_HOME}/bin:$PATH
Hadoop安装
配置集群互信
- 登录各个节点
- 创建rsa公钥秘钥
$ mkdir ~/.ssh
$ chmod 700 ~/.ssh
$ cd ~/.ssh
$ ssh-keygen -t rsa # 一路回车
- 整合公钥
ssh
表示ssh到host 并执行后面的command
$ ssh node-1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
...
$ ssh node-N cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys
- 分发公钥
scp 是用于在不同主机之间拷贝文件
$ scp ~/.ssh/authorized_keys node1:~/.ssh/
...
$ scp ~/.ssh/authorized_keys nodeN:~/.ssh/
- 测试, date命令执行成功即表示成功
ssh node1 date
...
ssh nodeN date
- 另:即使只有一个节点也需要配置, 否则需要输入很多次密码
ssh到自己本机地址测试,如果不需要密码即为设置正确
配置core-site.xml
fs.defaultFS
hdfs://:9000/
hadoop.tmp.dir
file:/data/hdfs/tmp
fs.trash.interval
1440
Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
配置hdfs-site.xml
dfs.replication
2
dfs.namenode.name.dir
/data/hdfs/name
dfs.datanode.data.dir
/data/hdfs/data
dfs.namenode.secondary.http-address
:9001
配置mapred-site.xml
mapred.job.tracker
:9001
配置yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
yarn.resourcemanager.address
:8032
yarn.resourcemanager.scheduler.address
:8030
yarn.resourcemanager.resource-tracker.address
:8035
yarn.resourcemanager.admin.address
:8033
yarn.resourcemanager.webapp.address
:8088
yarn.nodemanager.pmem-check-enabled
false
yarn.nodemanager.vmem-check-enabled
false
Amount of physical memory, in MB, that can be allocated for containers.
yarn.nodemanager.resource.memory-mb
3036
The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this won't take effect,
and the specified value will get allocated at minimum.
yarn.scheduler.minimum-allocation-mb
128
The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this won't take effect,
and will get capped to this value.
yarn.scheduler.maximum-allocation-mb
2560
配置yarn-env.sh/ hadoop-env.sh
在yarn-env.sh/ hadoop-env.sh文末配置JAVA_HOME
export JAVA_HOME=/opt/jdk1.8.0_161
即使配置了环境变量也需要,不知道为什么
配置slaves文件
注意要使用内网IP
slave1
...
slaveN
初始化namenode
hadoop namenode -format
启动
${hadoop_home}/sbin/start-yarn.sh
${hadoop_home}/sbin/start-dfs.sh
Hive安装
安装mysql
见mysql安装文档
下载mysql驱动,复制到hive_home/lib目录下
注意安装的mysql的版本
配置hive-site.xml
- 拷贝hive-default.xml.template到hive-site.xml,注意名称变了
- 添加配置:
datanucleus.fixedDatastore
false
datanucleus.autoCreateSchema
true
datanucleus.autoCreateTables
true
datanucleus.autoCreateColumns
true
system:java.io.tmpdir
/tmp
system:user.name
localadmin
- 修改配置:
hive.metastore.uris
thrift://:9083
Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
javax.jdo.option.ConnectionPassword
password to use against metastore database
javax.jdo.option.ConnectionUserName
mysqladmin
Username to use against metastore database
javax.jdo.option.ConnectionURL
jdbc:mysql://:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
hive.metastore.schema.verification
false
Enforce metastore schema version consistency.
True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
hive.default.fileformat
Orc
Expects one of [textfile, sequencefile, rcfile, orc].
Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
hive.merge.mapredfiles
true
Merge small files at the end of a map-reduce job
启动hive metastore
hive --service metastore &
Spark安装
配置项参考: http://spark.apache.org/docs/2.3.0/configuration.html
添加配置项到spark-default.conf,按需调整
spark.eventLog.enabled true
spark.eventLog.dir hdfs://:9000/eventLogs
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.master yarn
spark.driver.cores 1
spark.driver.memory 800m
spark.executor.cores 1
spark.executor.memory 1000m
spark.executor.instances 1
spark.sql.warehouse.dir hdfs://:9000/user/hive/warehouse
配置spark-env.sh
# pyspark需要python 3.x,不用python的不用配置
export PYSPARK_PYTHON=/usr/local/python-3.6.5/bin/python
export PYSPARK_DRIVER_PYTHON=python
配置spark读写hive
拷贝hive_home/conf/hive-site.xml 到spark_home/conf/目录下。
否则,spark读写的hive仓库和hive自己读写的是独立的
启动spark-shell,测试安装是否成功
${spark_home}/bin/spark-shell
如果启动之后,spark / sc两个对象正常生成,即为配置成功
Kafka安装(不用kafka就不装了)
配置config/server.properties
- log.dirs=/tmp/kafka-logs :kafka的topic,消息等信息的持久化保存的位置,建议换到不是/tmp的目录
配置config/zookeeper.properties
- dataDir=/tmp/zookeeper :the directory where the snapshot is stored
mutli broker安装
https://kafka.apache.org/quickstart#quickstart_multibroker
启动kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
Python3安装
见python3安装
总结
- spark-env.sh和spark-default.conf有很多等效配置项,但是我看spark官网给的配置项大多是spark-default风格的,我在修改配置时就尽量只修改spark-default的。
- 记下来方便以后再搭环境时使用,当然如果能帮到别人就最好了
- 如果有什么问题可以评论区留言,有错就改,知无不言_