Icesuns

Spark常见算子

这里，从源码的角度总结一下Spark RDD算子的用法。

文章目录

单值型Transformation算子

map
flatMap
mapPartitions
mapPartitionWithIndex
foreach
foreachPartition
glom
union
cartesian
groupBy
filter
distinct
subtract
persist与cache
sample

键值对型transformation算子

groupByKey
combineByKey
reduceByKey
sortByKey
cogroup
join

Action算子

collect
reduce
take
top
count
takeSample
saveAsTextFile
countByKey
aggregate
fold

单值型Transformation算子

map

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

源码中有一个 sc.clean() 函数，它的所用是去除闭包中不能序列话的外部引用变量。Scala支持闭包，闭包会把它对外的引用(闭包里面引用了闭包外面的对像)保存到自己内部，这个闭包就可以被单独使用了，而不用担心它脱离了当前的作用域；但是在spark这种分布式环境里，这种作法会带来问题，如果对外部的引用是不可serializable的，它就不能正确被发送到worker节点上去了；还有一些引用，可能根本没有用到，这些没有使用到的引用是不需要被发到worker上的；实际上sc.clean函数调用的是ClosureCleaner.clean()；ClosureCleaner.clean()通过递归遍历闭包里面的引用，检查不能serializable的, 去除unused的引用；

map函数是一个粗粒度的操作，对于一个RDD来说，会使用迭代器对分区进行遍历，然后针对一个分区使用你想要执行的操作f, 然后返回一个新的RDD。其实可以理解为rdd的每一个元素都会执行同样的操作。

scala> val array = Array(1,2,3,4,5,6)
array: Array[Int] = Array(1, 2, 3, 4, 5, 6)

scala> val rdd = sc.app
appName   applicationAttemptId   applicationId

scala> val rdd = sc.parallelize(array, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> val mapRdd = rdd.map(x => x * 2)  
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25

scala> mapRdd.collect().foreach(println)
2
4
6
8
10
12

flatMap

flatMap方法与map方法类似，但是允许一次map方法中输出多个对象，而不是map中的一个对象经过函数转换成另一个对象。

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

scala> val a = sc.parallelize(1 to 10, 5)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> a.flatMap(num => 1 to num).collect
res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

mapPartitions

mapPartitions是map的另一个实现，map的输入函数应用与RDD的每个元素，但是mapPartitions的输入函数作用于每个分区，也就是每个分区的内容作为整体。

 /**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

scala> def myfunc[T](iter: Iterator[T]):Iterator[(T,T)]={ 
     | var res = List[(T,T)]()
     | var pre = iter.next
     | while(iter.hasNext){
     |   var cur = iter.next
     |   res .::= (pre, cur)
     |   pre = cur
     | }
     | res.iterator
     | }
myfunc: [T](iter: Iterator[T])Iterator[(T, T)]

scala> val a = sc.parallelize(1 to 9,3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> a.mapPartitions
mapPartitions   mapPartitionsWithIndex

scala> a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

mapPartitionWithIndex

mapPartitionWithIndex方法与mapPartitions方法类似，不同的是mapPartitionWithIndex会对原始分区的索引进行追踪，这样就可以知道分区所对应的元素，方法的参数为一个函数，函数的输入为整型索引和迭代器。

  /**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

scala> val x = sc.parallelize(1 to 10, 3)
x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> def myFunc(index:Int, iter:Iterator[Int]):Iterator[String]={
     |   iter.toList.map(x => index + "," + x).iterator
     | }
myFunc: (index: Int, iter: Iterator[Int])Iterator[String]

scala> x.mapPartitions
mapPartitions   mapPartitionsWithIndex

scala> x.mapPartitionsWithIndex(myFunc).collect
res1: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

foreach

foreach主要对每一个输入的数据对象执行循环操作，可以用来执行对RDD元素的输出操作。

  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

scala> var x = sc.parallelize(List(1 to 9), 3)
x: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> x.foreach(print)
Range(1, 2, 3, 4, 5, 6, 7, 8, 9)

foreachPartition

foreachPartition方法和mapPartition的作用一样，通过迭代器参数对RDD中每一个分区的数据对象应用函数，区别在于使用的参数是否有返回值。

  /**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

scala> val b = sc.parallelize(List(1,2,3,4,5,6), 3)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> b.foreachPartition(x => println(x.reduce((a,b) => a +b)))
7
3
11

glom

glom的作用与collec类似，collect是将RDD直接转化为数组的形式，而glom则是将RDD分区数据组装到数组类型的RDD中，每一个返回的数组包含一个分区的所有元素，按分区转化为数组，有几个分区就返回几个数组类型的RDD。

  /**
   * Return an RDD created by coalescing all elements within each partition into an array.
   */
   
  def glom(): RDD[Array[T]] = withScope {
    new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
  }

下面的例子中，RDD a有三个分区，glom将a转化为由三个数组构成的RDD。

scala> val a = sc.parallelize(1 to 9, 3)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> a.glom.collect
res5: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))

scala> a.glom
res6: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[4] at glom at <console>:26

union

union方法与++方法是等价的，将两个RDD去并集，取并集的过程中不会去重。

  /**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T] = withScope {
    sc.union(this, other)
  }

  /**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def ++(other: RDD[T]): RDD[T] = withScope {
    this.union(other)
  }

scala> val a = sc.parallelize(1 to 4,2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> val b = sc.parallelize(2 to 5,1)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.un
union   unpersist

scala> a.union(b).collect
res7: Array[Int] = Array(1, 2, 3, 4, 2, 3, 4, 5)

cartesian

计算两个RDD中每个对象的笛卡尔积

  /**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

cala> val a = sc.parallelize(1 to 4,2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> val b = sc.parallelize(2 to 5,1)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> a.cartesian(b).collect
res8: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5), (4,2), (4,3), (4,4), (4,5))

groupBy

groupBy方法有三个重载方法，功能是讲元素通过map函数生成Key-Value格式，然后使用groupByKey方法对Key-Value进行聚合。

/**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

  /**
   * Return an RDD of grouped elements. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

  /**
   * Return an RDD of grouped items. Each group consists of a key and a sequence of elements
   * mapping to that key. The ordering of elements within each group is not guaranteed, and
   * may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

scala> val a = sc.parallelize(1 to 9,2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

scala> a.groupBy(x => {if(x % 2 == 0) "even" else "odd"}).collect
res9: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9)))

scala> def myfunc(a: Int):Int={
     | a % 2
     | }
myfunc: (a: Int)Int

scala> a.groupBy(myfunc).collect
res10: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9)))

scala> a.groupBy(myfunc(_), 1).collect
res11: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9)))

filter

filter方法对输入元素进行过滤，参数是一个返回值为boolean的函数，如果函数对元素的运算结果为true，则通过元素，否则就将该元素过滤，不进入结果集。

  /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

scala> val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>:24

scala> val b = a.filter(x => x.length >= 4)
b: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at filter at <console>:25

scala> b.collect.foreach(println)
from
China
from
America

distinct

distinct方法将RDD中重复的元素去掉，只留下唯一的RDD元素。

  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

scala> val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America"))
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> val b = a.map(x => x.length)
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at map at <console>:25

scala> val c = b.distinct
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at distinct at <console>:25

scala> c.foreach(println)
5
4
2
3
7

subtract

subtract方法就是求集合A-B，即把集合A中包含集合B的元素都删除，返回剩下的元素。

 /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be <= us.
   */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   */
  def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    subtract(other, new HashPartitioner(numPartitions))
  }

  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   */
  def subtract(
      other: RDD[T],
      p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    if (partitioner == Some(p)) {
      // Our partitioner knows how to handle T (which, since we have a partitioner, is
      // really (K, V)) so make a new Partitioner that will de-tuple our fake tuples
      val p2 = new Partitioner() {
        override def numPartitions: Int = p.numPartitions
        override def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)
      }
      // Unfortunately, since we're making a new p2, we'll get ShuffleDependencies
      // anyway, and when calling .keys, will not have a partitioner set, even though
      // the SubtractedRDD will, thanks to p2's de-tupled partitioning, already be
      // partitioned by the right/real keys (e.g. p).
      this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys
    } else {
      this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys
    }
  }

scala> val a = sc.parallelize(1 to 9, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24

scala> val b = sc.parallelize(2 to 5, 4)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24

scala> val c = a.subtract(b)
c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at subtract at <console>:27

scala> c.collect
res14: Array[Int] = Array(6, 8, 1, 7, 9)

persist与cache

cache,缓存数据，把RDD缓存到内存中，以便下次计算式再次被调用。persist是把RDD根据不同的级别进行持久化，通过参数指定持久化级别，如果不带参数则为默认持久化级别，即只保存到内存中，与Cache等价。

sample

sample方法的作用是随即对RDD中的元素进行采样，或得一个新的子RDD。根据参数制定是否放回采样，子集占总数的百分比和随机种子。

 /**
   * Return a sampled subset of this RDD.
   *
   * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
   * @param fraction expected size of the sample as a fraction of this RDD's size
   *  without replacement: probability that each element is chosen; fraction must be [0, 1]
   *  with replacement: expected number of times each element is chosen; fraction must be greater
   *  than or equal to 0
   * @param seed seed for the random number generator
   *
   * @note This is NOT guaranteed to provide exactly the fraction of the count
   * of the given [[RDD]].
   */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }
    }
  }

scala> val a = sc.parallelize(1 to 100, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24

scala> val b = a.sample(false, 0.2, 0)
b: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[32] at sample at <console>:25

scala> b.foreach(println)
5
19
20
26
27
29
30
57
40
61
45
68
73
50
75
79
81
85
89
99

键值对型transformation算子

groupByKey

类似于groupBy，将每一个相同的Key的Value聚集起来形成序列，可以使用默认的分区器和自定义的分区器。

  /**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   * The ordering of elements within each group is not guaranteed, and may even differ
   * each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with into `numPartitions` partitions. The ordering of elements within
   * each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   *
   * @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
   * key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.
   */
  def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(new HashPartitioner(numPartitions))
  }

scala> val a = sc.parallelize(List("mk", "zq", "xwc", "fig", "dcp", "snn"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> val b = a.keyBy(x => x.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[34] at keyBy at <console>:25

scala> b.groupByKey.collect
res17: Array[(Int, Iterable[String])] = Array((2,CompactBuffer(mk, zq)), (3,CompactBuffer(xwc, fig, dcp, snn)))

combineByKey

comineByKey方法能够有效地讲键值对形式的RDD相同的Key的Value合并成序列形式，用户能自定义RDD的分区器和是否在Map端进行聚合操作。

  /**
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. This method is here for backward compatibility. It does not provide combiner
   * classtag information to the shuffle.
   *
   * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
  }

  /**
   * Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
   * This method is here for backward compatibility. It does not provide combiner
   * classtag information to the shuffle.
   *
   * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
  }

scala> val a = sc.parallelize(List("xwc", "fig","wc", "dcp", "zq", "znn", "mk", "zl", "hk", "lp"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24

scala> val b = sc.parallelize(List(1,2,2,3,2,1,2,2,2,3),2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[37] at parallelize at <console>:24

scala> val c = b.zip(a)
c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[38] at zip at <console>:27


scala> val d = c.combineByKey(List(_), (x:List[String], y:String)=>y::x, (x:List[String], y:List[String])=>x::: y)
d: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[39] at combineByKey at <console>:25

scala> d.collect
res18: Array[(Int, List[String])] = Array((2,List(zq, wc, fig, hk, zl, mk)), (1,List(xwc, znn)), (3,List(dcp, lp)))

上面的例子使用三个参数重载的方法，该方法的第一个参数createCombiner把元素V转换成另一类元素C，该例子中使用的参数是List(_),表示将输入元素放在List集合中；第二个参数mergeValue的含义是吧元素V合并到元素C中，该例子中使用的是(x:List[String],y:String)=>y::x,表示将y字符合并到x链表集合中；第三个参数的含义是讲两个C元素合并，该例子中使用的是（x:List[String], y:List[String]）=>x:::y, 表示把x链表集合中的内容合并到y链表中。

reduceByKey

使用一个reduce函数来实现对想要的Key的value的聚合操作，发送给reduce前会在map端本地merge操作，该方法的底层实现是调用combineByKey方法的一个重载方法。

 /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

scala> val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val b = a.map(x => (x.length,x))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at map at <console>:25

scala> b.reduceByKey((a, b) => a + b ).collect
res1: Array[(Int, String)] = Array((2,wcza), (3,dcpfjgsnn))

sortByKey

根据Key值对键值对进行排序，如果是字符，则按照字典顺序排序，如果是数组则按照数字大小排序，可通过参数指定升序还是降序。

scala> val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:24

scala> val b = sc.parallelize(1 to a.count.toInt,2)
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:26

scala> val c = a.zip(b)
c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[6] at zip at <console>:27

scala> c.sortByKey(true).collect
res2: Array[(String, Int)] = Array((dcp,1), (fjg,2), (snn,3), (wc,4), (za,5))

cogroup

scala> val a = sc.parallelize(List(1,2,2,3,1,3),2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at :24

scala> val b = a.map(x => (x, "b"))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[11] at map at :25

scala> val c = a.map(x => (x, "c"))
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[12] at map at :25

scala> b.cogroup(c).collect
res3: Array[(Int, (Iterable[String], Iterable[String]))] = Array((2,(CompactBuffer(b, b),CompactBuffer(c, c))), (1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b, b),CompactBuffer(c, c))))




scala> val a = sc.parallelize(List(1,2,2,2,1,3),1)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at :24

scala> val b = a.map(x => (x, "b"))
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[16] at map at :25

scala> val c = a.map(x => (x, "c"))
c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at :25

scala> b.cogroup(c).collect
res4: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b, b, b),CompactBuffer(c, c, c))))

join

首先对RDD进行cogroup操作，然后对每个新的RDD下Key的值进行笛卡尔积操作，再返回结果使用flatmapValue方法。

scala> val a= sc.parallelize(List("fjg","wc","xwc"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[20] at parallelize at :24 

scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at :24

scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[23] at keyBy at :25

scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[24] at keyBy at :25

scala> b.join(d).collect
res6: Array[(Int, (String, String))] = Array((2,(wc,wc)), (2,(wc,zq)), (3,(fjg,fig)), (3,(fjg,sbb)), (3,(fjg,xwc)), (3,(fjg,dcp)), (3,(xwc,fig)), (3,(xwc,sbb)), (3,(xwc,xwc)), (3,(xwc,dcp)))

Action算子

collect

把RDD中的元素以数组的形式返回。

  /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

scala> val a = sc.parallelize(List("a", "b", "c"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> a.collect
res7: Array[String] = Array(a, b, c)

reduce

使用一个带两个参数的函数把元素进行聚集，返回一个元素的结果。该函数中的二元操作应该满足交换律和结合律，这样才能在并行计算中得到正确的计算结果。

  /**
   * Reduces the elements of this RDD using the specified commutative and
   * associative binary operator.
   */
  def reduce(f: (T, T) => T): T = withScope {
    val cleanF = sc.clean(f)
    val reducePartition: Iterator[T] => Option[T] = iter => {
      if (iter.hasNext) {
        Some(iter.reduceLeft(cleanF))
      } else {
        None
      }
    }
    var jobResult: Option[T] = None
    val mergeResult = (index: Int, taskResult: Option[T]) => {
      if (taskResult.isDefined) {
        jobResult = jobResult match {
          case Some(value) => Some(f(value, taskResult.get))
          case None => taskResult
        }
      }
    }
    sc.runJob(this, reducePartition, mergeResult)
    // Get the final result out of our Option, or throw an exception if the RDD was empty
    jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
  }

scala> val a = sc.parallelize(1 to 10, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[29] at parallelize at <console>:24

scala> a.reduce((a, b) => a + b)
res8: Int = 55

take

take方法会从RDD中取出前n个元素。先扫描一个分区，之后从分区中得到结果，然后评估该分区的元素是否满足n，若果不满足则继续从其他分区中扫描获取。

/**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @note Due to complications in the internal implementation, this method will raise
   * an exception if called on an RDD of `Nothing` or `Null`.
   */
  def take(num: Int): Array[T] = withScope {
    val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)
    if (num == 0) {
      new Array[T](0)
    } else {
      val buf = new ArrayBuffer[T]
      val totalParts = this.partitions.length
      var partsScanned = 0
      while (buf.size < num && partsScanned < totalParts) {
        // The number of partitions to try in this iteration. It is ok for this number to be
        // greater than totalParts because we actually cap it at totalParts in runJob.
        var numPartsToTry = 1L
        val left = num - buf.size
        if (partsScanned > 0) {
          // If we didn't find any rows after the previous iteration, quadruple and retry.
          // Otherwise, interpolate the number of partitions we need to try, but overestimate
          // it by 50%. We also cap the estimation in the end.
          if (buf.isEmpty) {
            numPartsToTry = partsScanned * scaleUpFactor
          } else {
            // As left > 0, numPartsToTry is always >= 1
            numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toInt
            numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)
          }
        }

        val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)
        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

        res.foreach(buf ++= _.take(num - buf.size))
        partsScanned += p.size
      }
      buf.toArray
    }
  }

scala> val a = sc.parallelize(1 to 10, 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24

scala> a.take(5)
res9: Array[Int] = Array(1, 2, 3, 4, 5)

top

top会采用隐式排序转换来获取最大的前n个元素。

  /**
   * Returns the top k (largest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of
   * [[takeOrdered]]. For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
   *   // returns Array(12)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
   *   // returns Array(6, 5)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of top elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }
  /**
   * Returns the first k (smallest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
   * For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
   *   // returns Array(2)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
   *   // returns Array(2, 3)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    if (num == 0) {
      Array.empty
    } else {
      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= collectionUtils.takeOrdered(items, num)(ord)
        Iterator.single(queue)
      }
      if (mapRDDs.partitions.length == 0) {
        Array.empty
      } else {
        mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1
        }.toArray.sorted(ord)
      }
    }
  }

scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24

scala> c.top(3)
res10: Array[Int] = Array(97, 32, 8)

count

count方法计算返回RDD中元素的个数。

  /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> c.count
res11: Long = 9

takeSample

返回一个固定大小的数组形式的采样子集，此外还会把返回元素的顺序随机打乱。

 /**
   * Return a fixed-size sampled subset of this RDD in an array
   *
   * @param withReplacement whether sampling is done with replacement
   * @param num size of the returned sample
   * @param seed seed for the random number generator
   * @return sample of specified size in an array
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T] = withScope {
    val numStDev = 10.0

    require(num >= 0, "Negative number of elements requested")
    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
      "Cannot support a sample size > Int.MaxValue - " +
      s"$numStDev * math.sqrt(Int.MaxValue)")

    if (num == 0) {
      new Array[T](0)
    } else {
      val initialCount = this.count()
      if (initialCount == 0) {
        new Array[T](0)
      } else {
        val rand = new Random(seed)
        if (!withReplacement && num >= initialCount) {
          Utils.randomizeInPlace(this.collect(), rand)
        } else {
          val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
            withReplacement)
          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()

          // If the first sample didn't turn out large enough, keep trying to take samples;
          // this shouldn't happen often because we use a big multiplier for the initial size
          var numIters = 0
          while (samples.length < num) {
            logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
            numIters += 1
          }
          Utils.randomizeInPlace(samples, rand).take(num)
        }
      }
    }
  }

scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2)
c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24

scala> c.takeSample(true,3, 1)
res14: Array[Int] = Array(1, 3, 7)

saveAsTextFile

将RDD存储为文本文件，一次存一行

countByKey

类似count，但是countByKey会根据Key计算对应的Value个数，返回Map类型的结果。

  /**
   * Count the number of elements for each key, collecting the results to a local Map.
   *
   * @note This method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   */
  def countByKey(): Map[K, Long] = self.withScope {
    self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
  }

scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2)
c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24

scala> val d = c.keyBy(_.length)
d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[37] at keyBy at <console>:25

scala> d.countByKey
res15: scala.collection.Map[Int,Long] = Map(2 -> 2, 3 -> 4)

aggregate

  /**
   * Aggregate the elements of each partition, and then the results for all the partitions, using
   * given combine functions and a neutral "zero value". This function can return a different result
   * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
   * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
   * allowed to modify and return their first argument instead of creating a new U to avoid memory
   * allocation.
   *
   * @param zeroValue the initial value for the accumulated result of each partition for the
   *                  `seqOp` operator, and also the initial value for the combine results from
   *                  different partitions for the `combOp` operator - this will typically be the
   *                  neutral element (e.g. `Nil` for list concatenation or `0` for summation)
   * @param seqOp an operator used to accumulate results within a partition
   * @param combOp an associative operator used to combine results from different partitions
   */
  def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
    // Clone the zero value since we will also be serializing it as part of tasks
    var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
    val cleanSeqOp = sc.clean(seqOp)
    val cleanCombOp = sc.clean(combOp)
    val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
    val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
    sc.runJob(this, aggregatePartition, mergeResult)
    jobResult
  }

fold

/**
   * Aggregate the elements of each partition, and then the results for all the partitions, using a
   * given associative function and a neutral "zero value". The function
   * op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object
   * allocation; however, it should not modify t2.
   *
   * This behaves somewhat differently from fold operations implemented for non-distributed
   * collections in functional languages like Scala. This fold operation may be applied to
   * partitions individually, and then fold those results into the final result, rather than
   * apply the fold to each element sequentially in some defined ordering. For functions
   * that are not commutative, the result may differ from that of a fold applied to a
   * non-distributed collection.
   *
   * @param zeroValue the initial value for the accumulated result of each partition for the `op`
   *                  operator, and also the initial value for the combine results from different
   *                  partitions for the `op` operator - this will typically be the neutral
   *                  element (e.g. `Nil` for list concatenation or `0` for summation)
   * @param op an operator used to both accumulate results within a partition and combine results
   *                  from different partitions
   */
  def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
    // Clone the zero value since we will also be serializing it as part of tasks
    var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
    val cleanOp = sc.clean(op)
    val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
    val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
    sc.runJob(this, foldPartition, mergeResult)
    jobResult
  }

你可能感兴趣的:(spark)

spark explain如何使用 fzip Spark spark 执行计划
在Spark中，explain是分析SQL或DataFrame执行计划的核心工具，通过不同模式可展示查询优化和执行的详细信息，默认情况下，这个语句只提供关于物理计划的信息。以下是具体使用方法及不同模式的作用：1.explain的基本语法在Spark3.0及以上版本，explain支持多种模式参数，通过mode指定输出格式：#DataFrame调用方式df.explain(mode="simple"
【Spark】查询优化中分区（Partitioning）和分桶（Bucketing）是什么关系？什么时候应当分区，什么时候应当分桶？ petrel2015 spark 大数据分布式数据库
在学习Spark的过程中，分区和分桶乍一看很像，都能为了计算加速，但是仔细一想，一查还是有些差异的，甚至说差异很大。那么具体有什么差异点，有什么相同点。我做出了如下的整理，供大家参考，欢迎指正。相同点分区（Partitioning）和分桶（Bucketing）在很多方面具有相似性，它们都是用于优化大数据查询性能的技术数据划分的目的：优化查询性能分区和分桶的核心目标是通过将数据分割成更小的逻辑单元来
pyspark学习rdd处理数据方法——学习记录亭午学习
python黑马程序员"""文件，按JSON字符串存储1.城市按销售额排名2.全部城市有哪些商品类别在售卖3.上海市有哪些商品类别在售卖"""frompysparkimportSparkConf,SparkContextimportosimportjsonos.environ['PYSPARK_PYTHON']=r"D:\anaconda\envs\py10\python.exe"#创建Spark
数据湖Iceberg、Hudi和Paimon比较_数据湖框架对比(1) 2301_79098963 程序员知识图谱人工智能
4.Schema变更支持对比项ApacheIcebergApacheHudiApachePaimonSchemaEvolutionALLback-compatibleback-compatibleSelf-definedschemaobjectYESNO(spark-schema)NO（我理解，不准确）SchemaEvolution：指schema变更的支持情况，我的理解是hudi仅支持添加可选列
Apache大数据旭哥优选大数据选题 Apache大数据旭大数据定制选题 java hadoop spark 开发语言 idea hive 数据库架构
定制旭哥服务，一对一，无中介包安装+答疑+售后态度和技术都很重要定制按需求做要求不高就实惠一点定制需提前沟通好怎么做，这样才能避免不必要的麻烦python、flask、Django、mapreduce、mysqljava、springboot、vue、echarts、hadoop、spark、hive、hbase、flink、SparkStreaming、kafka、flume、sqoop分析+推
Azure Delta Lake、Databricks和Event Hubs实现实时欺诈检测 weixin_30777913 azure 云计算
设计Azure云架构方案实现AzureDeltaLake和AzureDatabricks，结合AzureEventHubs/Kafka摄入实时数据，通过DeltaLake实现Exactly-Once语义，实时欺诈检测（流数据写入DeltaLake，批处理模型实时更新），以及具体实现的详细步骤和关键PySpark代码。完整实现代码需要根据具体数据格式和业务规则进行调整，建议通过DatabricksR
探索数据安全新境界：Apache Spark SQL Ranger Security插件深度揭秘乌昱有Melanie
探索数据安全新境界：ApacheSparkSQLRangerSecurity插件深度揭秘项目地址:https://gitcode.com/gh_mirrors/sp/spark-ranger随着大数据的爆炸性增长，数据安全性成为了企业不可忽视的核心议题。在这一背景下，【ApacheSparkSQLRangerSecurityPlugin】以其强大的数据访问控制能力脱颖而出，成为数据处理领域的明星级
基于Azure云平台构建实时数据仓库 weixin_30777913 云计算 azure 开发语言 spark python
设计Azure云架构方案实现AzureDeltaLake和AzureDatabricks，结合电商网站的流数据，构建实时数据仓库，支持T+0报表（如电商订单分析），具以及具体实现的详细步骤和关键PySpark代码。一、架构设计[电商网站]→[AzureEventHubs]→[AzureDatabricksStreaming]↓[AzureDeltaLake]←→[DatabricksSQLAnal
优化Apache Spark性能之JVM参数配置指南 weixin_30777913 jvm spark 大数据开发语言性能优化
ApacheSpark运行在JVM之上，JVM的垃圾回收（GC）、内存管理以及堆外内存使用情况，会直接对Spark任务的执行效率产生影响。因此，合理配置JVM参数是优化Spark性能的关键步骤，以下将详细介绍优化策略和配置建议。通过以下优化方法，可以显著减少GC停顿时间、提升内存利用率，进而提高Spark作业吞吐量和数据处理效率。同时，要根据具体的工作负载和集群配置进行调整，并定期监控Spark应
GraphCube、Spark和深度学习技术赋能快消行业关键运营环节 weixin_30777913 开发语言大数据深度学习人工智能 spark
在快消品（FMCG）行业，需求计划（DemandPlanning）、库存管理（InventoryManagement）和需求供应管理（DemandSupplyManagement）是影响企业整体效率和利润水平的关键运营环节。GraphCube图多维数据集技术、Spark大数据分析处理技术和深度学习技术的结合，为这些环节提供了智能化、动态化和实时化的解决方案，显著提升业务运营效率和企业利润。一、技术
【新品发售】NVIDIA 发布全球最小个人 AI 超级计算机 DGX Spark segmentfault
GTC2025大会上，NVIDIA正式推出了搭载NVIDIAGraceBlackwell平台的个人AI超级计算机——DGXSpark。赞奇可接受预订，直接私信后台即刻预订！DGXSpark(前身为ProjectDIGITS)支持AI开发者、研究人员、数据科学家和学生，在台式电脑上对大模型进行原型设计、微调和推理。用户可以在本地运行这些模型，或将其部署在NVIDIADGXCloud或任何其他加速云或
Kafka Connect Node.js Connector 指南丁操余
KafkaConnectNode.jsConnector指南kafka-connectequivalenttokafka-connect:wrench:fornodejs:sparkles::turtle::rocket::sparkles:项目地址:https://gitcode.com/gh_mirrors/ka/kafka-connect项目介绍KafkaConnectNode.jsConn
JAVA学习-练习试用Java实现“对大数据集中的网络日志进行解析和异常行为筛查” 守护者170 java学习 java 学习
问题：编写一个Spark程序，对大数据集中的网络日志进行解析和异常行为筛查。解答思路：下面是一个简单的Spark程序示例，用于解析网络日志并筛查异常行为。这个示例假设日志文件格式如下：timestamp,ip_address,user_id,action,event,extra_info2023-01-0112:00:00,192.168.1.1,123,login,success,none202
JAVA学习-练习试用Java实现“实现一个Spark应用，对大数据集中的文本数据进行情感分析和关键词筛选” 守护者170 java学习 java 学习
问题：实现一个Spark应用，对大数据集中的文本数据进行情感分析和关键词筛选。解答思路：要实现一个Spark应用，对大数据集中的文本数据进行情感分析和关键词筛选，需要按照以下步骤进行：1.环境准备确保的环境中已经安装了ApacheSpark。可以从[ApacheSpark官网](https://spark.apache.org/downloads.html)下载并安装。2.创建Spark应用以下是
Hive与Spark的UDF：数据处理利器的对比与实践窝窝和牛牛 hive spark hadoop
文章目录Hive与Spark的UDF：数据处理利器的对比与实践一、UDF概述二、HiveUDF解析实现原理代码示例业务应用三、SparkUDF剖析-JDBC方式使用SparkThriftServer设置通过JDBC使用UDFSparkUDF的Java实现（用于JDBC方式）通过beeline客户端连接使用业务应用场景四、Hive与SparkUDF在JDBC模式下的对比五、实际部署与最佳实践六、总结
尚硅谷电商数仓6.0，hive on spark,spark启动不了新时代赚钱战士 hive spark hadoop
在datagrip执行分区插入语句时报错[42000][40000]Errorwhilecompilingstatement:FAILED:SemanticExceptionFailedtogetasparksession:org.apache.hadoop.hive.ql.metadata.HiveException:FailedtocreateSparkclientforSparksessio
数据中台（二）数据中台相关技术栈 Yuan_CSDF #数据中台
1.平台搭建1.1.Amabari+HDP1.2.CM+CDH2.相关的技术栈数据存储：HDFS，HBase，Kudu等数据计算：MapReduce,Spark,Flink交互式查询：Impala,Presto在线实时分析：ClickHouse，Kylin，Doris，Druid，Kudu等资源调度：YARN，Mesos，Kubernetes任务调度：Oozie，Azakaban，AirFlow，
一文搞懂大数据神器Spark，真的太牛了！ qq_23519469 大数据 spark 分布式
Spark是什么在如今这个大数据时代，数据量呈爆炸式增长，传统的数据处理方式已经难以满足需求。就拿电商平台来说，每天产生的交易数据、用户浏览数据、评论数据等，数量巨大且种类繁多。假如要对这些数据进行分析，比如分析用户的购买行为，找出最受欢迎的商品，预测未来的销售趋势等，用普通的单机处理方式，可能需要花费很长时间，甚至根本无法完成。这时，Spark就应运而生了。Spark是一个开源的、基于内存计算的
Flink读取kafka数据并写入HDFS 王知无(import_bigdata) Flink系统性学习专栏 hdfs kafka flink
硬刚大数据系列文章链接：2021年从零到大数据专家的学习指南(全面升级版)2021年从零到大数据专家面试篇之Hadoop/HDFS/Yarn篇2021年从零到大数据专家面试篇之SparkSQL篇2021年从零到大数据专家面试篇之消息队列篇2021年从零到大数据专家面试篇之Spark篇2021年从零到大数据专家面试篇之Hbase篇
元戎启行最新战略RoadAGI：所有移动智能体都将被AI驱动量子位
2025年3月18日（北京时间），元戎启行作为国内人工智能企业代表，出席由NVIDIA主办的GTC大会。会上，公司CEO周光发表了技术主题演讲，展示了公司的最新战略布局RoadAGI，并发布道路通用人工智能平台——AISpark（以下简称”Spark平台”）。RoadAGI是元戎启行实现物理世界通用人工智能的关键一步，旨在让包括智能驾驶汽车在内的移动智能体，都具有在道路上自主行驶、与物理世界深度交
SparkSQL编程-RDD、DataFrame、DataSet 早拾碗吧 Spark spark hadoop 大数据 sparksql
三者之间的关系在SparkSQL中Spark为我们提供了两个新的抽象，分别是DataFrame和DataSet。他们和RDD有什么区别呢？首先从版本的产生上来看：RDD(Spark1.0)—>Dataframe(Spark1.3)—>Dataset(Spark1.6)如果同样的数据都给到这三个数据结构，他们分别计算之后，都会给出相同的结果。不同是的他们的执行效率和执行方式。在后期的Spark版本中
How Spark Read Sftp Files from Hadoop SFTP FileSystem IT•轩辕 Cloudy Computation spark hadoop 大数据
GradleDependenciesimplementation('org.apache.spark:spark-sql_2.13:3.5.3'){excludegroup:"org.apache.logging.log4j",module:"log4j-slf4j2-impl"}implementation('org.apache.hadoop:hadoop-common:3.3.4'){exc
pyspark 遇到**Py4JJavaError** Traceback (most recent call last) ~\AppData\ 2pi spark python
Py4JJavaErrorTraceback(mostrecentcalllast)~\AppData\Local\Temp/ipykernel_22732/1401292359.pyin---->1feat_df.show(5,vertical=True)D:\Anaconda3\envs\recall-service-cp4\lib\site-packages\pyspark\sql\data
中电金信25/3/18面前笔试（需求分析岗+数据开发岗）苍曦需求分析前端 javascript
部分相同题目在第二次数据开发岗中不做解析，本次解析来源于豆包AI，正确与否有待商榷，本文只提供一个速查与知识点的补充。一、需求分析第1题，单选题,Hadoop的核心组件包括HDFS和以下哪个？MapReduceSparkStormFlink解析：Hadoop的核心组件是HDFS（分布式文件系统）和MapReduce（分布式计算框架）。Spark、Storm、Flink虽然也是大数据处理相关技术，但
Spark集群启动与关闭陈沐 spark spark hadoop big data
Hadoop集群和Spark的启动与关闭Hadoop集群开启三台虚拟机均启动ZookeeperzkServer.shstartMaster1上面执行启动HDFSstart-dfs.shslave1上面执行开启YARNstart-yarn.shslave2上面执行开启YARN的资源管理器yarn-daemon.shstartresourcemanager(如果nodeManager没有启动(正常情况
Spark 解析_spark.sparkContext.getConf().getAll() 闯闯桑 spark 大数据分布式
spark.sparkContext.getConf().getAll()是ApacheSpark中的一段代码，用于获取当前Spark应用程序的所有配置项及其值。以下是逐部分解释：代码分解：spark：这是一个SparkSession对象，它是Spark应用程序的入口点，用于与Spark集群进行交互。spark.sparkContext：sparkContext是Spark的核心组件，负责与集群通
Pandas与PySpark混合计算实战：突破单机极限的智能数据处理方案 Eqwaak00 Pandas pandas 学习 python 科技开发语言
引言：大数据时代的混合计算革命当数据规模突破十亿级时，传统单机Pandas面临内存溢出、计算缓慢等瓶颈。PySpark虽能处理PB级数据，但在开发效率和局部计算灵活性上存在不足。本文将揭示如何构建Pandas+PySpark混合计算管道，在保留Pandas便捷性的同时，借助Spark分布式引擎实现百倍性能提升，并通过真实电商用户画像案例演示全流程实现。一、混合架构设计原理1.1技术栈优势分析维度P
自定义Spark启动的metastore_db和derby.log生成路径节昊文 spark 大数据分布式
1.进入安装spark目录的conf目录下2.复制spark-defaults.conf.template文件为spark-defaults.conf3.在spark-defaults.conf文件的末尾添加一行：spark.driver.extraJavaOptions-Dderby.system.home=/log即生成的文件存放的目录
介绍 Apache Spark 的基本概念和在大数据分析中的应用佛渡红尘 apache
ApacheSpark是一个开源的集群计算框架，最初由加州大学伯克利分校的AMPLab开发，用于大规模数据处理和分析。相比于传统的MapReduce框架，Spark具有更快的数据处理速度和更强大的计算能力。ApacheSpark的基本概念包括：弹性分布式数据集（RDD）：是Spark中基本的数据抽象，是一个可并行操作的分区记录集合。RDD可以在集群中的节点间进行分布式计算。转换（Transform
从“笨重大象”到“敏捷火花”：Hadoop与Spark的大数据技术进化之路 Echo_Wish 大数据大数据 hadoop spark
从“笨重大象”到“敏捷火花”：Hadoop与Spark的大数据技术进化之路说起大数据技术，Hadoop和Spark可以说是这个领域的两座里程碑。Hadoop曾是大数据的开山之作，而Spark则带领我们迈入了一个高效、灵活的大数据处理新时代。那么，它们的演变过程到底有何深意？背后技术上的取舍和选择，又意味着什么？一、Hadoop：分布式存储与计算的奠基者Hadoop诞生于互联网流量爆发式增长的时代，
ViewController添加button按钮解析。（翻译）张亚雄 c
<div class="it610-blog-content-contain" style="font-size: 14px"></div>// ViewController.m // Reservation software // // Created by 张亚雄 on 15/6/2.
mongoDB 简单的增删改查开窍的石头 mongodb
在上一篇文章中我们已经讲了mongodb怎么安装和数据库/表的创建。在这里我们讲mongoDB的数据库操作在mongo中对于不存在的表当你用db.表名他会自动统计下边用到的user是表明，db代表的是数据库添加(insert):
log4j配置 0624chenhong log4j
1) 新建java项目 2) 导入jar包，项目右击，properties—java build path—libraries—Add External jar，加入log4j.jar包。 3) 新建一个类com.hand.Log4jTest package com.hand; import org.apache.log4j.Logger; public class
多点触摸(图片缩放为例) 不懂事的小屁孩多点触摸
多点触摸的事件跟单点是大同小异的，上个图片缩放的代码，供大家参考一下 import android.app.Activity; import android.os.Bundle; import android.view.MotionEvent; import android.view.View; import android.view.View.OnTouchListener
有关浏览器窗口宽度高度几个值的解析换个号韩国红果果 JavaScript html
1 元素的 offsetWidth 包括border padding content 整体的宽度。 clientWidth 只包括内容区 padding 不包括border。 clientLeft = offsetWidth -clientWidth 即这个元素border的值 offsetLeft 若无已定位的包裹元素
数据库产品巡礼：IBM DB2概览蓝儿唯美 db2
IBM DB2是一个支持了NoSQL功能的关系数据库管理系统，其包含了对XML，图像存储和Java脚本对象表示（JSON）的支持。DB2可被各种类型的企业使用，它提供了一个数据平台，同时支持事务和分析操作，通过提供持续的数据流来保持事务工作流和分析操作的高效性。 DB2支持的操作系统 DB2可应用于以下三个主要的平台: 工作站，DB2可在Linus、Unix、Windo
java笔记5 a-john java
控制执行流程： 1，true和false 利用条件表达式的真或假来决定执行路径。例：（a==b）。它利用条件操作符“==”来判断a值是否等于b值，返回true或false。java不允许我们将一个数字作为布尔值使用，虽然这在C和C++里是允许的。如果想在布尔测试中使用一个非布尔值，那么首先必须用一个条件表达式将其转化成布尔值，例如if(a!=0)。 2，if-els
Web开发常用手册汇总 aijuans PHP
一门技术，如果没有好的参考手册指导,很难普及大众。这其实就是为什么很多技术，非常好，却得不到普遍运用的原因。正如我们学习一门技术，过程大概是这个样子： ①我们日常工作中，遇到了问题，困难。寻找解决方案，即寻找新的技术； ②为什么要学习这门技术？这门技术是不是很好的解决了我们遇到的难题，困惑。这个问题，非常重要，我们不是为了学习技术而学习技术，而是为了更好的处理我们遇到的问题，才需要学习新的
今天帮助人解决的一个sql问题 asialee sql
今天有个人问了一个问题，如下： type AD value A
意图对象传递数据百合不是茶 android 意图Intent Bundle对象数据的传递
学习意图将数据传递给目标活动; 初学者需要好好研究的 1,将下面的代码添加到main.xml中 <?xml version="1.0" encoding="utf-8"?> <LinearLayout xmlns:android="http:/
oracle查询锁表解锁语句 bijian1013 oracle object session kill
一.查询锁定的表如下语句，都可以查询锁定的表语句一： select a.sid, a.serial#, p.spid, c.object_name, b.session_id, b.oracle_username, b.os_user_name from v$process p, v$s
mac osx 10.10 下安装 mysql 5.6 二进制文件［tar.gz］征客丶 mysql osx
场景：在 mac osx 10.10 下安装 mysql 5.6 的二进制文件。环境：mac osx 10.10、mysql 5.6 的二进制文件步骤：[所有目录请从根“/”目录开始取，以免层级弄错导致找不到目录] 1、下载 mysql 5.6 的二进制文件，下载目录下面称之为 mysql5.6SourceDir；下载地址：http://dev.mysql.com/downl
分布式系统与框架 bit1129 分布式
RPC框架 Dubbo 什么是Dubbo Dubbo是一个分布式服务框架，致力于提供高性能和透明化的RPC远程服务调用方案，以及SOA服务治理方案。其核心部分包含: 远程通讯: 提供对多种基于长连接的NIO框架抽象封装，包括多种线程模型，序列化，以及“请求-响应”模式的信息交换方式。集群容错: 提供基于接
那些令人蛋痛的专业术语白糖_ spring Web SSO IOC
spring 【控制反转(IOC)/依赖注入(DI)】：由容器控制程序之间的关系，而非传统实现中，由程序代码直接操控。这也就是所谓“控制反转”的概念所在：控制权由应用代码中转到了外部容器，控制权的转移，是所谓反转。简单的说：对象的创建又容器(比如spring容器)来执行，程序里不直接new对象。 Web 【单点登录(SSO)】：SSO的定义是在多个应用系统中，用户
《给大忙人看的java8》摘抄 braveCS java8
函数式接口：只包含一个抽象方法的接口 lambda表达式：是一段可以传递的代码你最好将一个lambda表达式想象成一个函数，而不是一个对象，并记住它可以被转换为一个函数式接口。事实上，函数式接口的转换是你在Java中使用lambda表达式能做的唯一一件事。方法引用：又是要传递给其他代码的操作已经有实现的方法了，这时可以使
编程之美-计算字符串的相似度 bylijinnan java 算法编程之美
public class StringDistance { /** * 编程之美计算字符串的相似度 * 我们定义一套操作方法来把两个不相同的字符串变得相同，具体的操作方法为： * 1.修改一个字符（如把“a”替换为“b”）; * 2.增加一个字符（如把“abdd”变为“aebdd”）; * 3.删除一个字符（如把“travelling”变为“trav
上传、下载压缩图片 chengxuyuancsdn 下载
/** * * @param uploadImage --本地路径(tomacat路径) * @param serverDir --服务器路径 * @param imageType --文件或图片类型 * 此方法可以上传文件或图片.txt,.jpg,.gif等 */ public void upload(String uploadImage,Str
bellman-ford(贝尔曼-福特)算法 comsci 算法 F#
Bellman-Ford算法(根据发明者 Richard Bellman 和 Lester Ford 命名)是求解单源最短路径问题的一种算法。单源点的最短路径问题是指：给定一个加权有向图G和源点s，对于图G中的任意一点v，求从s到v的最短路径。有时候这种算法也被称为 Moore-Bellman-Ford 算法，因为 Edward F. Moore zu 也为这个算法的发展做出了贡献。与迪科
oracle ASM中ASM_POWER_LIMIT参数 daizj ASM oracle ASM_POWER_LIMIT 磁盘平衡
ASM_POWER_LIMIT 该初始化参数用于指定ASM例程平衡磁盘所用的最大权值，其数值范围为0~11，默认值为1。该初始化参数是动态参数，可以使用ALTER SESSION或ALTER SYSTEM命令进行修改。示例如下： SQL>ALTER SESSION SET Asm_power_limit=2;
高级排序:快速排序 dieslrae 快速排序
public void quickSort(int[] array){ this.quickSort(array, 0, array.length - 1); } public void quickSort(int[] array,int left,int right){ if(right - left <= 0
C语言学习六指针_何谓变量的地址一个指针变量到底占几个字节 dcj3sjt126com C语言
# include <stdio.h> int main(void) { /* 1、一个变量的地址只用第一个字节表示 2、虽然他只使用了第一个字节表示，但是他本身指针变量类型就可以确定出他指向的指针变量占几个字节了 3、他都只存了第一个字节地址，为什么只需要存一个字节的地址，却占了4个字节，虽然只有一个字节，但是这些字节比较多，所以编号就比较大，
phpize使用方法 dcj3sjt126com PHP
phpize是用来扩展php扩展模块的，通过phpize可以建立php的外挂模块,下面介绍一个它的使用方法,需要的朋友可以参考下安装（fastcgi模式）的时候，常常有这样一句命令：代码如下: /usr/local/webserver/php/bin/phpize 一、phpize是干嘛的？ phpize是什么？ phpize是用来扩展php扩展模块的，通过phpi
Java虚拟机学习 - 对象引用强度 shuizhaosi888 JAVA虚拟机
本文原文链接：http://blog.csdn.net/java2000_wl/article/details/8090276 转载请注明出处！无论是通过计数算法判断对象的引用数量，还是通过根搜索算法判断对象引用链是否可达，判定对象是否存活都与“引用”相关。引用主要分为：强引用(Strong Reference)、软引用(Soft Reference)、弱引用(Wea
.NET Framework 3.5 Service Pack 1（完整软件包）下载地址 happyqing .net 下载 framework
Microsoft .NET Framework 3.5 Service Pack 1（完整软件包） http://www.microsoft.com/zh-cn/download/details.aspx?id=25150 Microsoft .NET Framework 3.5 Service Pack 1 是一个累积更新，包含很多基于 .NET Framewo
JAVA定时器的使用 jingjing0907 java timer 线程定时器
1、在应用开发中，经常需要一些周期性的操作，比如每5分钟执行某一操作等。对于这样的操作最方便、高效的实现方式就是使用java.util.Timer工具类。 privatejava.util.Timer timer; timer = newTimer(true); timer.schedule( newjava.util.TimerTask() { public void run()
Webbench 流浪鱼 webbench
首页下载地址 http://home.tiscali.cz/~cz210552/webbench.html Webbench是知名的网站压力测试工具，它是由Lionbridge公司（http://www.lionbridge.com）开发。 Webbench能测试处在相同硬件上，不同服务的性能以及不同硬件上同一个服务的运行状况。webbench的标准测试可以向我们展示服务器的两项内容：每秒钟相
第11章动画效果（中） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
windows下制作bat启动脚本. sanyecao2314 java cmd 脚本 bat
java -classpath C:\dwjj\commons-dbcp.jar;C:\dwjj\commons-pool.jar;C:\dwjj\log4j-1.2.16.jar;C:\dwjj\poi-3.9-20121203.jar;C:\dwjj\sqljdbc4.jar;C:\dwjj\voucherimp.jar com.citsamex.core.startup.MainStart
Java进行RSA加解密的例子 tomcat_oracle java
加密是保证数据安全的手段之一。加密是将纯文本数据转换为难以理解的密文；解密是将密文转换回纯文本。　　数据的加解密属于密码学的范畴。通常，加密和解密都需要使用一些秘密信息，这些秘密信息叫做密钥，将纯文本转为密文或者转回的时候都要用到这些密钥。　　对称加密指的是发送者和接收者共用同一个密钥的加解密方法。　　非对称加密(又称公钥加密)指的是需要一个私有密钥一个公开密钥，两个不同的密钥的
Android_ViewStub 阿尔萨斯 ViewStub
public final class ViewStub extends View java.lang.Object android.view.View android.view.ViewStub 类摘要： ViewStub 是一个隐藏的，不占用内存空间的视图对象，它可以在运行时延迟加载布局资源文件。当 ViewSt