pyspark学习之分布式hadoop+spark集群环境搭建

环境搭建

hadoop+spark

前期准备

  1. 配置免密登录
    生成密钥:ssh-keygen -t rsa
    添加密钥 cat ~/id_ras.pub >> ~/authorized_keys
    scp 传输同步到其他节点 scp 文件 user@hostname:路径
    exp: scp scp ~/.ssh/id_rsa.pub root@root:~
  2. 配置hosts vim /etc/hosts
  3. 安装jdk1.8 离线安装命令为 rpm -ivh java-1.8.0-openjdk-devel-1.8.0.161-2.b14.el7.x86_64.rpm

安装步骤

Hadoop

  1. 下载安装包,上传到主机服务器
  2. 在服务器将安装包解压到/usr/local/目录下
    命令: tar -zxf /root/Download/hadoop.tar.gz -C /usr/local
  3. 重命名 命令:mv ./hadoop2.7/ ./hadoop
  4. 编辑~/.bashrc

export HADOOP_HOME=/usr/local/hadoop
export PATH = $PATH:$HADOOP_HOME/bin:$HADOOP_HOME/ sbin

source ~/.bashrc 让配置生效
5. 进入/usr/local/hadoop/etc/hadoop目录中修改配置
注:下文配置中的master为主节点的主机名
slaves 写入主机名

core-site.xml


      
          hadoop.tmp.dir
          /usr/local/hadoop/tmp
          Abase for other temporary directories.
      
      
          fs.defaultFS
          hdfs://master:9000
      
  

hdfs-site.xml:


    
        dfs.replication
        3
    
    

mapred-site.xml


    
        mapreduce.framework.name
        yarn
    
  

yarn-site.xml


  
      
          yarn.nodemanager.aux-services
          mapreduce_shuffle
      
      
          yarn.resourcemanager.hostname
          master
      
  
  1. 将文件发送至从节点

cd /usr/local/
rm -rf ./hadoop/tmp # 删除临时文件
rm -rf ./hadoop/logs/* # 删除日志文件
tar -zcf ~/hadoop.master.tar.gz ./hadoopcd ~
scp ./hadoop.master.tar.gz slave01:/home/hadoop
scp ./hadoop.master.tar.gz slave02:/home/hadoop
  1. 在从节点上执行
sudo rm -rf /usr/local/hadoop/
sudo tar -zxf ~/hadoop.master.tar.gz -C /usr/local
  1. 启动hadoop集群
cd /usr/local/hadoop
bin/hdfs namenode -formatsbin/start-all.sh

Spark

  1. 下载spark安装包
  2. 解压spark到/usr/local/
  3. 改名
  4. 配置环境变量

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3

source ~/.bashrc
5. 修改配置

slaves

cd /usr/local/spark/
cp ./conf/slaves.template ./conf/slaves

spark-env.sh


export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SPARK_MASTER_IP=192.168.1.104

6.复制文件到从节点机器上

cd /usr/local/
tar -zcf ~/spark.master.tar.gz ./sparkcd ~
scp ./spark.master.tar.gz slave01:/home/hadoop
scp ./spark.master.tar.gz slave02:/home/hadoop

7.执行操作

sudo rm -rf /usr/local/spark/sudo 
tar -zxf ~/spark.master.tar.gz -C /usr/local

8.启动集群
先启动hadoop,再启动spark master ,再启动spark worker

master

cd /usr/local/spark/
sbin/start-master.sh

worker

sbin/start-slave.sh --master spark://master:7077

你可能感兴趣的:(pyspark大数据分析)