Spark入门教程

这篇文章是翻译http://spark.apache.org/docs/latest/programming-guide.html官方的指导手册

转载注明:ylf13@元子

一、Overview概述

在spark应用程序中,有一个Driver Program(驱动程序)来执行用户定义的main函数,并且在集群上执行各种并行操作。Spark主要提供的抽象层是RDD(resilient distributed dataset,弹性的分布式数据集),这个抽象层是集群上各个节点存储的元素的集合,可以并行运行。创建RDD的方式很多,例如Hadoop文件系统上的文件,或者现有的Scala集合(Scala语言提供的数据结构)。用户也可以让spark将RDD在内存中持久化,这使得在并行操作中能够快速重用。(译者注:对于迭代运算有很好的效率提升,例如一些machine learning需要反复运用中见结果,如果是Hadoop MR Job,会不断的写入hdfs,大量磁盘IO对于算法性能影响较大),最后,RDD的容错性也很好,实现自动恢复。

 

第二个抽象层是spark的共享变量,该共享变量可以在并行操作中实现共享。默认情况下,当spark在不同节点的并行环境下运行一个函数时,spark会将该函数所需的变量拷贝到各个节点中,各自运行时调用本地的副本进行计算。然而,有时我们需要一个变量能够被不同任务所共享,或者是在task和driver program之间共享(译者注:这个问题在hadoop MR中就存在,无法共享同一个静态变量,当然可以借助其他手段,例如存储在hdfs上,每次调用时读取,或者redis缓存等等机制)。这里spark提供了两种共享变量:

(1)广播变量(broadcast varables):缓存在所有节点的内存中

(2)积累变量(accumulators)仅仅允许进行加法运算

 

二‘连接Spark

下面以Java为例子

To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:

groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.3.0

In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page.

groupId = org.apache.hadoop
artifactId = hadoop-client
version = 

Finally, you need to import some Spark classes into your program. Add the following lines:

importorg.apache.spark.api.java.JavaSparkContext
importorg.apache.spark.api.java.JavaRDD
importorg.apache.spark.SparkConf

 


当然,如果不用maven,可以直接在eclipse中配置external jars导入spark/lib目录下的jar包。 


三、初始化Spark

首先需要创建一个JavaSparkContext对象,该对象告诉Spark如何访问击晕,在创建SparkContext之前需要创建一个配置对象SparkConf,该对象包含应用程序的相关信息。

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);

JavaSparkContext  sc  =  new JavaSparkContext(conf);

 

appName: 显示在cluster UI上的名字

master: 是一个Spark, Mesos 或者YARN集群的URL或者特殊字符“local”,表示以本地模式运行。

 

四、使用Shell

spark自带了一个交互式shell界面。该shell默认中已经定义好SparkContext变量,用sc变量表示,如果需要指定sparkContext指向的master,可以带上参数:--master URL

如果需要在类路径下加入新的jar包,--jars

 

./bin/spark-shell –master local[4]  --jars code.jar

 

五、RDD介绍

创建RDD有两种方式:

(1)在driver program中并行化现有的集合

(2)指向一个现有的外部存储系统(HDFS, HBASE, 或者别的数据源)

 

5.1 并行化集合(parallelizedCollections)

并行化集合是通过JavaSparkContext的并行化方法:parallelize(),作用在现有的集合上。该集合会被拷贝到别的节点上,以供并行操作。

List  data = Arrays.asList(1 ,2, 3, 4, 5);

JavaRDD distData  =  sc.parallelize(data);

 

这样我们就可以在distData上进行并行操作:

distData.reduce((a,b) -> a+b )

但需要注意,上面这个内部函数是利用了Java8的函数机制,在老版本的Java中没有该特性,所以需要利用org.apache.spark.api.java.function 来取代。

 

在并行化集合过程中,还有一个参数很重要,就是集合的分区数目,spark究竟会把集合拆分成几块,都影响着计算效率。对每个分区一般spark都会执行一个task,一般对于一个CPU可以执行2-4个分区,据此可以进行合理分区:

sc.parallelize(data, 10) // 指定分区数量

 

5.2 外部数据集

外部数据源可以支持多种:本地文件系统,HDFS, HBase, Amazon S3,等。支持的文件类型:text file,SequenceFile,和其他Hadoop 的inputformat

例如创建本地的文本文件,读取成一个行的集合。

JavaRDD distFile = sc.textFile(“data.txt”);

 

同时,textFile也支持参数为目录,或者带有通配符的路径

sc.textFile(“/home/ylf/examples/”)

sc.textFile(“/home/ylf/examples/*.txt”)

 

六 、RDD操作

RDD支持两种操作类型:transformatons和 actions

(1)transformations:从现有的数据集中创建一个新的

(2)actions:对数据集计算完成后,返回给driver program是一个值。

例如map是一种transformations,它将现有的数据集元素通过计算,返回一个新的RDD

Reduce则是一个actions运算,聚合RDD上的所有元素。但是有一个特例:reduceByKey不是actions,返回的是一个RDD.

 

所有的transformations都是lazy(延迟的),在进行transformations操作时候,不会立刻执行该命令。这些transformatons会在actions操作命令需要进行时候执行,这就使得Spark运行的效率更高。

 

另一个方法就是内存持久化:persist or cache。

 

下面我们来一个简单的例子:

JavaRDD lines = sc.textFile(“data.txt”);

JavaRDD lineLengths = lines.map(line -> line.length);

Int totalLength = lineLengths.reduce((a,b) -> a + b);

 

该例子中,lineLengths并不会迅速得到结果,因为transformations的laziness,在reduce命令下达时,整个计算才开始。

如果我们需要再次使用lineLengths,可以执行持久化:

lineLengths.persist()

before the reduce, which would cause lineLengths to be saved in memory after the first time it is computed.

 

七、传递函数给Spark

Spark依赖于driver program传递的函数进行计算,在Java中,函数变量的实现有两种方式:

(1)实现org.apache.spark.api.java.function包里的接口

(2)Java8自带的lambda 表达式

 

前面我们已经用lambda例子了,下面举个传统方式

// map function

Class GetLineLength implements Function{

  Public Integer call(String s){

    Return s.length;

}

}

 

// reduce function

Class Sum implements Function2{

  Public Integer call(Integer a, Integer b){

    Return a+b;

}

}

 

// driver function

JavaRDD file = sc.textFile(“xxx”);

JavaRDD lineLengths = file.map(new GetLineLength());

Int total = lineLengths.reduce(new Sum());

 

 

八、使用Key-Value键值对

当我们需要统计单词的词频时候,我们需要键值对来保存对应word的词频,所以就有了键值对的需求,虽然Spark的RDD支持多种类型,但是键值对却较少,在Java中,key-value使用scala.Tuple2类来表示。简单new Tuple2(a, b)就可以了。

 

返回的RDD类型也有所变化,JavaPairRDD

JavaRDD file = sc.textFile(“xxx”);

JavaPairRDD paris = file.mapToPair(line -> new Tuple2(line, 1));

JavaPairRDD counters = pairs.reduceByKey((a,b) -> a+b);

Counters.sortByKey();

Counters.collect()可以将结果表示成Java的数组

 

如果要自定义Object作为key,记得重写hashCode() equals()

 

九、常用的Transformations and Actions

Transformations

下面罗列一些Spark支持的transformations变化。

Transformation

 解释

Map(func)

传递原数据集的每一个element,返回一个新的分布式数据集。映射

Filter(func)

从原数据集中选择出满足给定条件的elements,组成新的RDD

flatMap(func)

第一步和map一致,然后再进行扁平化,即把数组合并

MapPartitions(func)

也和map类似,只是这里操作的是分区block,不是每一个element,所以func(iterator  =>  iterator)  对block的数据进行运算

mapPartitionsWithIndex

(func)

类似mapPartitions,不过会传递一个分区号,所以输入变成

Func(  (Int, Iterator)  =>  Iterator)

Sample(withReplacement

fractionseed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

distinct([numTasks]))

Return a new dataset that contains the distinct elements of the source dataset.

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. 
Note: 
If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

aggregateByKey(zeroValue)(seqOpcombOp, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numTasks])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the booleanascending argument.

join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.

cartesian(otherDataset)

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

pipe(command[envVars])

Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery

 

Actions

Thefollowing table lists some of the common actions supported by Spark. Refer tothe RDD API doc (ScalaJavaPython) and pair RDD functions doc (ScalaJava) for details.

Action

Meaning

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count()

Return the number of elements in the dataset.

first()

Return the first element of the dataset (similar to take(1)).

take(n)

Return an array with the first n elements of the dataset.

takeSample(withReplacement,num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n[ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

saveAsSequenceFile(path
(Java and Scala)

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that either implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

saveAsObjectFile(path
(Java and Scala)

Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile().

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems.

 

十、RDD持久化

Spark的另一个优点就是数据集内存持久化,当执行持久化命令后,集群上每个节点都缓存各自分区,然后能够供下次action重用,能够提高运行速度(经常能达到hadoop 10X),缓存机制也为迭代提供了快速运算方案。

 

RDD持久化方法很简单,只需要运行persist()或者 cache()方法,第一次执行action后,运算结果就会保存下来,供以后重复使用,同时,Spark也是容错的,如果某个node机器上数据丢失,集群会根据记录的transformations来重新创建。

 

其实缓存有多种类型,不仅仅可以持久化在内存,还可以硬盘,缓存方式可以是序列化等,在执行persist()时候可以指定类型,cache()则默认采用内存缓存。

Storage Level

Meaning

MEMORY_ONLY

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER

Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY

Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental)

Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory.

 

 

当然内存不是无限的,Spark会自动监控每个节点的内存使用情况,然后使用LRU(latest-recently-used)算法来清理久远的结果,当然也可以手动删除,rdd.unpersist()

 

十一、共享变量

集群机器在执行task时候,都是各个节点保持自己的运算所需元素,不会进行同步,因为同步会带来效率的降低,但是有些场合又不得不使用共享变量,所以Spark做了折衷,仅仅提供两种类型的共享变量:broadcast variables and accumulators.

11.1 broadcast variables

Broadcast变量运行程序保存一份只读变量缓存在每台机器上,这与原来传递每一份给task是不同的。

Broadcast varables是从普通变量创建而来的,使用SparkContext.broadcast(v),使用value()可以访问,但不可以修改,这是read-only.

Broadcast broadcastVar = sc.broadcast(new int[]{1,2,3});

broadcastVar.value();  // return [1, 2, 3]

 

11.2 accumulators

这个变量可以在分布式下计算“加法”计算

Accumulator accum = sc.accumulator(0);

Sc.parallelize(Arrays.asList(1,2,3,4)).foreach(x->accum.add(x));

Accum.value();

即便数据源是在分布式环境下,但是依然能够保证accum的一致性。可能只是谁先加

运算可能性: 1+2+3+4 或者 2+4+3+1 等等。

 

当然这里的加法参数我们用的内建的Integer,也可以使用我们自定义的AccumulatorParam

这个AccumulatorParam有两种方法:zero(提供一个“零值”)和addInPlace(进行加法运算)。

Class VectorAccumulatorParam  implements  AccumulatorParam{

  Public Vector zero(){

    Return Vector.zero(initialValue.size());

}

  Public Vector addInPlace(Vector v1, Vector v2){

    V1.addInPlace(v2); return v1; // 就是自定义两个对象怎么进行加法运算啦

}

}

 

这样,我们就可以自定义加法类型

Accumulator vecAccum = sc.accumulator(new Vector(..), new VectorAccumulatorParam())

第一个参数就是初始值,后面就是我们定义的加法

 

补充:如何运行打包好的jar包

其实可以参照你的Spark安装目录bin下的run-example脚本

$cd $SPARK_HOME/bin
$vi run-example

可以看到最后执行命令是

./spark-submit --master xxxx --class $Main-class $Jar 参数
所以我们只要指定master jar包以及对应的主函数所在类即可

你可能感兴趣的:(大数据和分布式)