安装
官网:http://spark.apache.org/
下载后解压直接使用
./bin/spark-shell
术语
RDD(Resilient Distributed Datasets)
弹性分布式数据集 - 基于内存的只读数据分区集合
Dataset / DataFrame
高级RDD,提供了更抽象的操作接口(eg. sql查询)
Transformations
针对RDD的各种转换操作,包括map,flatmap,filter,groupByKey等。Lazy模式,只要不触发Actions,都只记录该操作
Actions
针对RDD的各种执行操作,包括collect, count, reduceByKey, take等。触发所有操作(eg.读取数据),获取结果
Broadcast
广播变量,由Driver向每个Executor发送一份,而不是为每个task发送一份
Lambda
匿名函数,语法糖,函数式编程(eg. x -> x + 1; (x, y) -> x + y)
例子
word count
JavaRDD textFile = sc.textFile("test.txt");
JavaPairRDD counts = textFile
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("...");
Transformations 转化操作
map
JavaRDD rdd1 = sc.textFile("test.txt");;
JavaRDD rdd2 = rdd1.map(x -> x + " ");
mapToPair
JavaPairRDD rdd3 = rdd2.mapToPair(x -> new Tuple2<>(x, 1));
filter
JavaPairRDD rdd4 = rdd3.filter(x -> ! x._1.equals(" "));
groupByKey
JavaPairRdd> rdd5 = rdd4.groupByKey();
mapValues
JavaPairRdd rdd6 = rdd5.mapValues(x -> StreamSupport.stream(x.spliterator(), false)
.collect(Collectors.toList()).size());
Actions 行动操作
collect
rdd6.collect().forEach(x -> repo.save(x));
Spark Sql
Dataset msgs = MongoSpark.load(jsc, readConfig).toDF();
msgs.createOrReplaceTempView("msg_table");
Dataset msgs1 = sparkServer.getSparkSession().sql(
XString.format( "select id, ts, sd from msg_table " +
"where ts >= {} and ts <= {}", startTs, endTs));
集群组件图
Cluster Manager
- Standalone
- Apache Mesos
- Hadoop YARN
- Kubernetes
集群部署模式
Client Model
Cluster Model
Standlone配置
- 需配置Master自动登录到各个Worker的ssh key
- 安装JAVA 8 jre
Spark Master配置
/etc/hosts
x.x.x.x worker1
$SPARK_HOME/conf/slaves
worker1
$SPARK_HOME/conf/spark-env.sh
export SPARK_MASTER_HOST=__master_ip__
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=4
Spark Worker配置
~/.profile
export SPARK_HOME=/var/services/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
$SPARK_HOME/conf/spark-env.sh
export SPARK_MASTER_HOST=__master_ip__
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=4
Spark Master启动Spark 集群
dev@master:~$ ./$SPARK_HOME/sbin/start-all.sh
参考:
http://spark.apache.org/docs/latest/cluster-overview.html
https://www.cnblogs.com/chengjunhao/p/8028264.html