spark streaming

  1. 概述

Spark Streaming类似于Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming有高吞吐量和容错能力强等特点。(https://spark.apache.org/streaming/)

 Spark Streaming支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如:map、reduce、join、window等进行运算。而结果也能保存在很多地方,如HDFS,数据库等。另外Spark Streaming也能和MLlib(机器学习)以及Graphx完美融合。

  1. DStream

Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark原语操作后的结果数据流。在内部实现上,DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据,如下图:

对数据的操作也是按照RDD为单位来进行的

计算过程由Spark engine来完成

从图中也能看出它将输入的数据分成多个batch进行处理,严格来说spark streaming 并不是一个真正的实时框架,因为他是分批次进行处理的。

  1. DStreams相关操作

3.1 DS Transformations on DStreams

Transfor mation

Meaning

map(func)

 

Return a new DStream by passing each  element of the

source DStream through a function func.

flatMap(func)

 

Similar to map, but each input item can  be mapped to 0 or more output items.

 

filter(func)

 

Return a new DStream by selecting only  the records of the source DStream on which func returns true.

 

repartition(numPartitions)

 

Changes the level of parallelism in this  DStream by creating more or fewer partitions.

 

union(otherStream)

 

Return a new DStream that contains the  union of the elements in the source DStream and otherDStream.

 

count()

 

Return a new DStream of single-element  RDDs by counting the number of elements in each RDD of the source DStream.

 

reduce(func)

 

Return a new DStream of single-element  RDDs by aggregating the elements in each RDD of the source DStream using a  function func (which takes two arguments and returns one). The function  should be associative so that it can be computed in parallel.

 

countByValue()

 

When called on a DStream of elements of  type K, return a new DStream of (K, Long) pairs where the value of each key  is its frequency in each RDD of the source DStream.

 

reduceByKey(func, [numTasks])    

 

 

When called on a DStream of (K, V) pairs,  return a new DStream of (K, V) pairs where the values for each key are  aggregated using the given reduce function. Note: By default, this uses  Spark's default number of parallel tasks (2 for local mode, and in cluster  mode the number is determined by the config property  spark.default.parallelism) to do the grouping. You can pass an optional  numTasks argument to set a different number of tasks.

 

join(otherStream, [numTasks])

 

When called on two DStreams of (K, V) and  (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of  elements for each key.

cogroup(otherStream, [numTasks])

When called on a DStream of (K, V) and  (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.

 

transform(func)  

Return a new DStream by applying a  RDD-to-RDD function to every RDD of the source DStream. This can be used to  do arbitrary RDD operations on the DStream.

 

updateStateByKey(func)

 

Return a new "state" DStream  where the state for each key is updated by applying the given function on the  previous state of the key and the new values for the key. This can be used to  maintain arbitrary state data for each key.

 

 

 

你可能感兴趣的:(大数)