Spark2.2.1+hadoop2.6.1安装配置成功运行WordCount

    hadoop2.6.1安装配置可以参考:分布式环境搭建redhat7+hadoop2.6.1+jdk1.8+WordCount成功运行例子

Scala安装与配置

1.下载Scala包

2. 新建一个目录,将scala包复制进去使用如下命令解压

tar -zxvf scala-2.12.4.tgz

3. 配置环境变量

vi /etc/profile

在文件末尾增加以下内容:

export SCALA_HOME=/home/spark/scala-2.12.4

使用命令使文件生效

source /etc/profile

4.将解压后的包发送给slave节点

scp -r /home/spark/scala-2.12.4 root@slave1:/home/ spark/scala-2.12.4

scp -r /home/spark/scala-2.12.4 root@slave2:/home/ spark/scala-2.12.4

5.记得在slave节点上配置profile文件

6.使用命令查看Scala是否安装成功,显示如下即为成功

[root@master scala-2.12.4]# scala

Welcome to Scala 2.12.4 (Java HotSpot(TM)64-Bit Server VM, Java 1.8.0_131).

Type in expressions for evaluation. Or try:help.


Spark安装与配置

1.下载与当前hadoop版本相匹配的spark包

以下配置操作在master节点配置,配置好后发送到其他的slave节点

2.新建一个目录,将spark包复制进去使用如下命令解压

tar -zxvf spark-2.2.1-bin-hadoop2.6.tgz

3.配置环境变量

vi /etc/profile

在文件末尾增加以下内容:

exportSPARK_HOME=/home/spark/spark-2.2.1-bin-hadoop2.6/

export SCALA_HOME=/home/spark/scala-2.12.4

exportHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$SPARK_HOME/bin:$SCALA_HOME/bin

CLASSPATH=:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib/dt.jar

export JAVA_HOME JRE_HOME PATH CLASSPATH

使用命令使文件生效

source /etc/profile

4.进入sparkhome,配置spark-env.sh文件,新安装的spark中会有模板文件,直接复制一份,然后做相应配置

cp spark-env.sh.template spark-env.sh

vi spark-env.sh

在文件中增加如下内容:

export JAVA_HOME=/usr/java/jdk1.8.0_131

export SCALA_HOME=/home/spark/scala-2.12.4

exportHADOOP_CONF_DIR=/home/hadoop/hadoop-2.6.1/etc/hadoop

export SPARK_MASTER_IP=172.19.0.189

export SPARK_MASTER_HOST=172.19.0.189

export SPARK_LOCAL_IP=172.19.0.189

export SPARK_WORKER_MEMORY=1g

export SPARK_WORKER_CORES=2

exportSPARK_HOME=/home/spark/spark-2.2.1-bin-hadoop2.6

5.配置slaves文件,根据自己的集群中slave节点配置

cp slaves.template slaves

vi slaves

在文件中增加内容

slave1

slave2

6. 将解压后的包发送给slave节点

scp -r/home/spark/spark-2.2.1-bin-hadoop2.6 root@slave1:/home/ spark/ spark-2.2.1-bin-hadoop2.6

scp -r/home/spark/spark-2.2.1-bin-hadoop2.6 root@slave2:/home/ spark/ spark-2.2.1-bin-hadoop2.6

7.记得在slave节点上配置profile文件

8. 使用命令查看spark是否安装成功,显示如下即为成功

[root@master scala-2.12.4]# spark-shell

Setting default log level to"WARN".

To adjust logging level usesc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

18/03/28 16:33:52 WARN NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable

Spark context Web UI available athttp://172.19.0.189:4040

Spark context available as 'sc' (master =local[*], app id = local-1522269233515).

Spark session available as 'spark'.

Welcome to

     ____              __

    / __/__  ___ _____/ /__

   _\ \/ _ \/ _ `/ __/  '_/

  /___/ .__/\_,_/_/ /_/\_\   version2.2.1

     /_/

        

Using Scala version 2.11.8 (JavaHotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)

Type in expressions to have them evaluated.

Type :help for more information.

 

WordCount例子程序

       spark有多种运行模式,我已经有hadoop和yarn的环境,所以直接以yarn模式运行wordcount例子。因为是在yarn上运行,因此不需要单独启动spark进程,若是其他模式运行需要执行start-all.sh 命令。

1.在本地eclipse中新建Java Project,新建一个类WordCount,找到spark包中自带的examples文件夹,从中找到wordcount例子程序,将其拷到WordCount中,然后导出jar包,我将导出的jar包命名为spark-test.jar,这里也顺带贴上我使用的wordcount代码

import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class WordCount {
	private static final Pattern SPACE = Pattern.compile(" ");

	  public static void main(String[] args) throws Exception {

	    if (args.length < 1) {
	      System.err.println("Usage: JavaWordCount ");
	      System.exit(1);
	    }

	    SparkSession spark = SparkSession
	      .builder()
	      .appName("JavaWordCount_1")
	      .getOrCreate();

	    JavaRDD lines = spark.read().textFile(args[0]).javaRDD();

	    JavaRDD words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());

	    JavaPairRDD ones = words.mapToPair(s -> new Tuple2<>(s, 1));

	    JavaPairRDD counts = ones.reduceByKey((i1, i2) -> i1 + i2);

	    List> output = counts.collect();
	    for (Tuple2 tuple : output) {
	      System.out.println(tuple._1() + ": " + tuple._2());
	    }
	    spark.stop();
	  }
}

2.在HDFS中新建一个目录,

hadoop fs -mkdir / wordcount

hadoop fs -mkdir / wordcount /input

将两个文件f1和f2放入HDFS新建的目录中,f1和f2内容随意

hadoop fs -put f1 / wordcount /input

hadoop fs -put f2 / wordcount /input

3.使用如下命令运行

spark-submit--master yarn-client --name spark-test --class WordCount --executor-memory 1G--total-executor-cores 2/home/spark/spark-2.2.1-bin-hadoop2.6/input/spark-test.jarhdfs://master:9000/wordcount/input/

4.运行时大致内容如下

18/03/2815:24:37 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 155 mson slave2 (executor 2) (1/2)

18/03/2815:24:37 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 23 mson slave2 (executor 2) (2/2)

18/03/2815:24:37 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have allcompleted, from pool

18/03/2815:24:37 INFO DAGScheduler: ResultStage 1 (collect at WordCount.java:33)finished in 0.179 s

18/03/2815:24:37 INFO DAGScheduler: Job 0 finished: collect at WordCount.java:33, took7.068927 s

zjw1: 1

hello: 2

bye: 2

world: 2

zjw2: 1

18/03/2815:24:37 INFO SparkUI: Stopped Spark web UI at http://172.19.0.189:4040

18/03/2815:24:37 INFO YarnClientSchedulerBackend: Interrupting monitor thread

18/03/2815:24:37 INFO YarnClientSchedulerBackend: Shutting down all executors

你可能感兴趣的:(spark学习笔记)