hadoop2.6.1安装配置可以参考:分布式环境搭建redhat7+hadoop2.6.1+jdk1.8+WordCount成功运行例子
1.下载Scala包
2. 新建一个目录,将scala包复制进去使用如下命令解压
tar -zxvf scala-2.12.4.tgz
3. 配置环境变量
vi /etc/profile
在文件末尾增加以下内容:
export SCALA_HOME=/home/spark/scala-2.12.4
使用命令使文件生效
source /etc/profile
4.将解压后的包发送给slave节点
scp -r /home/spark/scala-2.12.4 root@slave1:/home/ spark/scala-2.12.4
scp -r /home/spark/scala-2.12.4 root@slave2:/home/ spark/scala-2.12.4
5.记得在slave节点上配置profile文件
6.使用命令查看Scala是否安装成功,显示如下即为成功
[root@master scala-2.12.4]# scala
Welcome to Scala 2.12.4 (Java HotSpot(TM)64-Bit Server VM, Java 1.8.0_131).
Type in expressions for evaluation. Or try:help.
1.下载与当前hadoop版本相匹配的spark包
以下配置操作在master节点配置,配置好后发送到其他的slave节点
2.新建一个目录,将spark包复制进去使用如下命令解压
tar -zxvf spark-2.2.1-bin-hadoop2.6.tgz
3.配置环境变量
vi /etc/profile
在文件末尾增加以下内容:
exportSPARK_HOME=/home/spark/spark-2.2.1-bin-hadoop2.6/
export SCALA_HOME=/home/spark/scala-2.12.4
exportHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$SPARK_HOME/bin:$SCALA_HOME/bin
CLASSPATH=:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib/dt.jar
export JAVA_HOME JRE_HOME PATH CLASSPATH
使用命令使文件生效
source /etc/profile
4.进入sparkhome,配置spark-env.sh文件,新安装的spark中会有模板文件,直接复制一份,然后做相应配置
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
在文件中增加如下内容:
export JAVA_HOME=/usr/java/jdk1.8.0_131
export SCALA_HOME=/home/spark/scala-2.12.4
exportHADOOP_CONF_DIR=/home/hadoop/hadoop-2.6.1/etc/hadoop
export SPARK_MASTER_IP=172.19.0.189
export SPARK_MASTER_HOST=172.19.0.189
export SPARK_LOCAL_IP=172.19.0.189
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_CORES=2
exportSPARK_HOME=/home/spark/spark-2.2.1-bin-hadoop2.6
5.配置slaves文件,根据自己的集群中slave节点配置
cp slaves.template slaves
vi slaves
在文件中增加内容
slave1
slave2
6. 将解压后的包发送给slave节点
scp -r/home/spark/spark-2.2.1-bin-hadoop2.6 root@slave1:/home/ spark/ spark-2.2.1-bin-hadoop2.6
scp -r/home/spark/spark-2.2.1-bin-hadoop2.6 root@slave2:/home/ spark/ spark-2.2.1-bin-hadoop2.6
7.记得在slave节点上配置profile文件
8. 使用命令查看spark是否安装成功,显示如下即为成功
[root@master scala-2.12.4]# spark-shell
Setting default log level to"WARN".
To adjust logging level usesc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/28 16:33:52 WARN NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-javaclasses where applicable
Spark context Web UI available athttp://172.19.0.189:4040
Spark context available as 'sc' (master =local[*], app id = local-1522269233515).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version2.2.1
/_/
Using Scala version 2.11.8 (JavaHotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
spark有多种运行模式,我已经有hadoop和yarn的环境,所以直接以yarn模式运行wordcount例子。因为是在yarn上运行,因此不需要单独启动spark进程,若是其他模式运行需要执行start-all.sh 命令。
1.在本地eclipse中新建Java Project,新建一个类WordCount,找到spark包中自带的examples文件夹,从中找到wordcount例子程序,将其拷到WordCount中,然后导出jar包,我将导出的jar包命名为spark-test.jar,这里也顺带贴上我使用的wordcount代码
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public class WordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount ");
System.exit(1);
}
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount_1")
.getOrCreate();
JavaRDD lines = spark.read().textFile(args[0]).javaRDD();
JavaRDD words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD counts = ones.reduceByKey((i1, i2) -> i1 + i2);
List> output = counts.collect();
for (Tuple2,?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
spark.stop();
}
}
2.在HDFS中新建一个目录,
hadoop fs -mkdir / wordcount
hadoop fs -mkdir / wordcount /input
将两个文件f1和f2放入HDFS新建的目录中,f1和f2内容随意
hadoop fs -put f1 / wordcount /input
hadoop fs -put f2 / wordcount /input
3.使用如下命令运行
spark-submit--master yarn-client --name spark-test --class WordCount --executor-memory 1G--total-executor-cores 2/home/spark/spark-2.2.1-bin-hadoop2.6/input/spark-test.jarhdfs://master:9000/wordcount/input/
4.运行时大致内容如下
18/03/2815:24:37 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 155 mson slave2 (executor 2) (1/2)
18/03/2815:24:37 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 23 mson slave2 (executor 2) (2/2)
18/03/2815:24:37 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have allcompleted, from pool
18/03/2815:24:37 INFO DAGScheduler: ResultStage 1 (collect at WordCount.java:33)finished in 0.179 s
18/03/2815:24:37 INFO DAGScheduler: Job 0 finished: collect at WordCount.java:33, took7.068927 s
zjw1: 1
hello: 2
bye: 2
world: 2
zjw2: 1
18/03/2815:24:37 INFO SparkUI: Stopped Spark web UI at http://172.19.0.189:4040
18/03/2815:24:37 INFO YarnClientSchedulerBackend: Interrupting monitor thread
18/03/2815:24:37 INFO YarnClientSchedulerBackend: Shutting down all executors