Spark 2.3 on yarn的配置安装

这是一篇spark环境的安装文档,不知道为什么查了下网上的安装步骤总是感觉怪怪的,有把环境变量配置到spark-env.sh的,有配置了yarn然后启动spark-standalone服务的,虽然不能保证我的方法是最标准的,但是至少我觉得比较合理

安装参考

  • Spark on yarn的安装: http://wuchong.me/blog/2015/04/04/spark-on-yarn-cluster-deploy/
  • hive安装 :http://dblab.xmu.edu.cn/blog/install-hive/

解压

  1. 下载java、scala、hadoop、spark、hive、kafka、python3.x(pyspark使用)压缩包
  2. 解压 tar -zxf xxx.tar.gz tar -zxf xxx.tgz
  3. 添加{xxx}_HOME到全局环境变量
    export {xxx}_HOME= 到 /etc/profile
  4. source /etc/profile 来使得环境变量生效
  5. 附:我的配置
export JAVA_HOME=/opt/jdk1.8.0_161
export SCALA_HOME=/opt/scala-2.11.11
export HADOOP_HOME=/opt/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_YARN_USER_ENV=${HADOOP_CONF_DIR}
export SPARK_HOME=/opt/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/opt/hive-2.3.3-bin
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export PYTHON_HOME=/usr/local/python-3.6.5

export PATH=${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${HIVE_HOME}/bin:${PYTHON_HOME}/bin:$PATH


Hadoop安装

配置集群互信

  1. 登录各个节点
  2. 创建rsa公钥秘钥
    $ mkdir ~/.ssh
    $ chmod 700 ~/.ssh
    $ cd ~/.ssh
    $ ssh-keygen -t rsa  # 一路回车
  1. 整合公钥
    ssh 表示ssh到host 并执行后面的command
    $ ssh node-1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    ...
    $ ssh node-N cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    $ chmod 600 ~/.ssh/authorized_keys
  1. 分发公钥
    scp 是用于在不同主机之间拷贝文件
    $ scp ~/.ssh/authorized_keys node1:~/.ssh/
    ... 
    $ scp ~/.ssh/authorized_keys nodeN:~/.ssh/
  1. 测试, date命令执行成功即表示成功
    ssh node1 date
    ...
    ssh nodeN date
  1. 另:即使只有一个节点也需要配置, 否则需要输入很多次密码
    ssh到自己本机地址测试,如果不需要密码即为设置正确

配置core-site.xml

    
        fs.defaultFS
        hdfs://:9000/
    
    
         hadoop.tmp.dir
         file:/data/hdfs/tmp
    
     
        fs.trash.interval    
        1440    
        Number of minutes between trash checkpoints.    
        If zero, the trash feature is disabled.
            
    

配置hdfs-site.xml

    
        dfs.replication
        2
    
    
        dfs.namenode.name.dir
        /data/hdfs/name
    
    
        dfs.datanode.data.dir
        /data/hdfs/data
    
    
        dfs.namenode.secondary.http-address
        :9001
     

配置mapred-site.xml

    
        mapred.job.tracker
        :9001
    

配置yarn-site.xml

    
        
            yarn.nodemanager.aux-services
            mapreduce_shuffle
        
        
            yarn.nodemanager.aux-services.mapreduce.shuffle.class
            org.apache.hadoop.mapred.ShuffleHandler
        
        
            yarn.resourcemanager.address
            :8032
        
        
            yarn.resourcemanager.scheduler.address
            :8030
        
        
            yarn.resourcemanager.resource-tracker.address
            :8035
        
        
            yarn.resourcemanager.admin.address
            :8033
        
        
            yarn.resourcemanager.webapp.address
            :8088
        
        
            yarn.nodemanager.pmem-check-enabled
            false
        
        
            yarn.nodemanager.vmem-check-enabled
            false
        
       
        
            Amount of physical memory, in MB, that can be allocated for containers.
            yarn.nodemanager.resource.memory-mb
            3036
        
        
            The minimum allocation for every container request at the RM,
                         in MBs. Memory requests lower than this won't take effect,
                         and the specified value will get allocated at minimum.
            yarn.scheduler.minimum-allocation-mb
            128
        
        
            The maximum allocation for every container request at the RM,
                         in MBs. Memory requests higher than this won't take effect,
                         and will get capped to this value.
            yarn.scheduler.maximum-allocation-mb
            2560
        
    

配置yarn-env.sh/ hadoop-env.sh

在yarn-env.sh/ hadoop-env.sh文末配置JAVA_HOME
export JAVA_HOME=/opt/jdk1.8.0_161
即使配置了环境变量也需要,不知道为什么

配置slaves文件

注意要使用内网IP

    slave1
    ...
    slaveN

初始化namenode

hadoop namenode -format

启动

${hadoop_home}/sbin/start-yarn.sh
${hadoop_home}/sbin/start-dfs.sh


Hive安装

安装mysql

见mysql安装文档

下载mysql驱动,复制到hive_home/lib目录下

注意安装的mysql的版本

配置hive-site.xml

  1. 拷贝hive-default.xml.template到hive-site.xml,注意名称变了
  2. 添加配置:
    
        datanucleus.fixedDatastore
        false
    
    
        datanucleus.autoCreateSchema
        true
    
    
        datanucleus.autoCreateTables
        true
    
    
        datanucleus.autoCreateColumns
        true
    
    
        system:java.io.tmpdir
        /tmp
    
    
        system:user.name
        localadmin
    
  1. 修改配置:
    
        hive.metastore.uris
        thrift://:9083
        Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
    
    
    
        javax.jdo.option.ConnectionPassword
        
        password to use against metastore database
    
     
        javax.jdo.option.ConnectionUserName
        mysqladmin
        
        Username to use against metastore database
      
    
        javax.jdo.option.ConnectionURL
        jdbc:mysql://:3306/hive?createDatabaseIfNotExist=true
        
          JDBC connect string for a JDBC metastore.
          To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
          For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
        
    
    
        javax.jdo.option.ConnectionDriverName
        com.mysql.jdbc.Driver
        
        Driver class name for a JDBC metastore
    
    
    
        hive.metastore.schema.verification
        false
        
          Enforce metastore schema version consistency.
          True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
                schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
                proper metastore schema migration. (Default)
          False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
        
    
    
        hive.default.fileformat
        Orc
        
        
          Expects one of [textfile, sequencefile, rcfile, orc].
          Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
        
    
    
        hive.merge.mapredfiles
        true
        Merge small files at the end of a map-reduce job
      

启动hive metastore

hive --service metastore &


Spark安装

配置项参考: http://spark.apache.org/docs/2.3.0/configuration.html

添加配置项到spark-default.conf,按需调整

spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://:9000/eventLogs
spark.eventLog.compress          true

spark.serializer                 org.apache.spark.serializer.KryoSerializer

spark.master                    yarn 
spark.driver.cores              1
spark.driver.memory             800m 
spark.executor.cores            1
spark.executor.memory           1000m
spark.executor.instances        1

spark.sql.warehouse.dir         hdfs://:9000/user/hive/warehouse

配置spark-env.sh

# pyspark需要python 3.x,不用python的不用配置
export PYSPARK_PYTHON=/usr/local/python-3.6.5/bin/python
export PYSPARK_DRIVER_PYTHON=python

配置spark读写hive

拷贝hive_home/conf/hive-site.xml 到spark_home/conf/目录下。
否则,spark读写的hive仓库和hive自己读写的是独立的

启动spark-shell,测试安装是否成功

${spark_home}/bin/spark-shell
如果启动之后,spark / sc两个对象正常生成,即为配置成功


Kafka安装(不用kafka就不装了)

配置config/server.properties

  • log.dirs=/tmp/kafka-logs :kafka的topic,消息等信息的持久化保存的位置,建议换到不是/tmp的目录

配置config/zookeeper.properties

  • dataDir=/tmp/zookeeper :the directory where the snapshot is stored

mutli broker安装

https://kafka.apache.org/quickstart#quickstart_multibroker

启动kafka

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties


Python3安装

见python3安装

总结

  1. spark-env.sh和spark-default.conf有很多等效配置项,但是我看spark官网给的配置项大多是spark-default风格的,我在修改配置时就尽量只修改spark-default的。
  2. 记下来方便以后再搭环境时使用,当然如果能帮到别人就最好了
  3. 如果有什么问题可以评论区留言,有错就改,知无不言_

你可能感兴趣的:(Spark 2.3 on yarn的配置安装)