tanglizhe1105

Spark RDD算子源码解读

本文结合spark1.5.0的源码，简单介绍Spark RDD算子的功能、原理及调用接口。此处介绍只包含RDD默认基础算子，并不包含RDD[(K, V)]的扩展算子（即PairRDDFunctions中的方法），对于扩展算子，请见后续文章。

++
返回两个RDD元素合并后组成的RDD，保留重复元素

/** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */
  def ++(other: RDD[T]): RDD[T] = withScope {
    this.union(other)
  }

aggregate
将RDD中元素聚集，须提供0初值（因为累积元素，所有要提供累积的初值）。先在分区内依照seqOp函数聚集元素（把T类型元素聚集为U类型的分区“结果”），再在分区间按照combOp函数聚集分区计算结果，最后返回这个标量

  /** * Aggregate the elements of each partition, and then the results for all the partitions, using * given combine functions and a neutral "zero value". This function can return a different result * type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U * and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are * allowed to modify and return their first argument instead of creating a new U to avoid memory * allocation. */
  def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

cache
将RDD缓冲在内存中

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def cache(): this.type = persist()

cartesian
返回一个RDD，其元素为两个RDD元素的笛卡尔乘积
如：
RDD1 RDD2 => RDD3
a b (a,b)
c d (a,d)
(c,b)
(c,d)

  /** * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of * elements (a, b) where a is in `this` and b is in `other`. */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }

checkpoint
为RDD设置检查点，即将RDD内容入分布式文件系统（如HDFS）中，移除RDD的依赖关系（lineage)。是一种典型的以空间换取时间做法，采用存储数据来避免失效时通过计算依赖链的时间代价。跟persist意义不一样。

  /** * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. This function must be called before any job has been * executed on this RDD. It is strongly recommended that this RDD is persisted in * memory, otherwise saving it on a file will require recomputation. */
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

coalesce
重新给RDD的元素分区。
当适当缩小分区数时，如1000->100，spark会把之前的10个分区当作一个分区，并行度变为100，不会引起数据shuffle。
当严重缩小分区数时，如1000->1，运算时的并行度会变成1。为了避免并行效率低下问题，可将shuffle设为true。shuffle之前的运算和之后的运算分为不同stage，它们的并行度分别为1000,1。
当把分区数增大时，必会存在shuffle，shuffle须设为true。

  /** * Return a new RDD that is reduced into `numPartitions` partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. * * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). * * Note: With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T]

collect
返回包含RDD中所有元素的数组，当RDD元素超多时，driver节点可能会out of memory。

  /** * Return an array that contains all of the elements in this RDD. */
  def collect(): Array[T] = withScope {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
    Array.concat(results: _*)
  }

  /** * Return an RDD that contains all matching values by applying `f`. */
  def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    filter(cleanF.isDefinedAt).map(cleanF)
  }

count
返回RDD中元素个数

  /** * Return the number of elements in the RDD. */
  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

countByValue
返回RDD中的不同元素的个数

  /** * Return the count of each unique value in this RDD as a local map of (value, count) pairs. * * Note that this method should only be used if the resulting map is expected to be small, as * the whole thing is loaded into the driver's memory. * To handle very large results, consider using rdd.map(x =&gt; (x, 1L)).reduceByKey(_ + _), which * returns an RDD[T, Long] instead of a map. */
  def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
    map(value => (value, null)).countByKey()
  }

dependencies
返回RDD的依赖列表，从lineage的检查点算起。

  /** * Get the list of dependencies of this RDD, taking into account whether the * RDD is checkpointed or not. */
  final def dependencies: Seq[Dependency[_]] = {
    checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
      if (dependencies_ == null) {
        dependencies_ = getDependencies
      }
      dependencies_
    }
  }

distinct
返回RDD中元素去重后的RDD

  /** * Return a new RDD containing the distinct elements in this RDD. */
  def distinct(): RDD[T] = withScope {
    distinct(partitions.length)
  }

  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
  }

filter
过滤RDD，返回一个包含所有满足f函数判定的元素的RDD

  /** * Return a new RDD containing only the elements that satisfy a predicate. */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

filterWith
过滤RDD，返回一个包含所有满足f函数判定的元素的RDD。为每个分区的判断函数f指定一个参数，该参数依据分区号产生。可以实现各分区区别对待进行过滤。
详见mapPartitionsWithIndex。

  /** * Filters this RDD with p, where p takes an additional parameter of type A. This * additional parameter is produced by constructA, which is called in each * partition with the index of that partition. */
  @deprecated("use mapPartitionsWithIndex and filter", "1.0.0")
  def filterWith[A](constructA: Int => A)(p: (T, A) => Boolean): RDD[T] = withScope {
    val cleanP = sc.clean(p)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex((index, iter) => {
      val a = cleanA(index)
      iter.filter(t => cleanP(t, a))
    }, preservesPartitioning = true)
  }

first
返回RDD中第一个元素。
集合为空时异常。

  /** * Return the first element in this RDD. */
  def first(): T = withScope {
    take(1) match {
      case Array(t) => t
      case _ => throw new UnsupportedOperationException("empty collection")
    }
  }

firstParent
返回RDD依赖中的第一个父RDD

  /** Returns the first parent RDD */
  protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
    dependencies.head.rdd.asInstanceOf[RDD[U]]
  }

flatMap
对RDD中的每个元素，产生一个集合，返回全部集合里的所有元素组成的RDD。
flatMapWith
对RDD中的每个元素，产生一个集合，返回全部集合里的所有元素组成的RDD。
为每个分区的产生集合函数f指定一个参数，该参数依据分区号产生。可以实现各分区元素区别对待进行产生集合。
详见mapPartitionsWithIndex。

  /** * Return a new RDD by first applying a function to all elements of this * RDD, and then flattening the results. */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

  /** * FlatMaps f over this RDD, where f takes an additional parameter of type A. This * additional parameter is produced by constructA, which is called in each * partition with the index of that partition. */
  @deprecated("use mapPartitionsWithIndex and flatMap", "1.0.0")
  def flatMapWith[A, U: ClassTag]
      (constructA: Int => A, preservesPartitioning: Boolean = false)
      (f: (T, A) => Seq[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex((index, iter) => {
      val a = cleanA(index)
      iter.flatMap(t => cleanF(t, a))
    }, preservesPartitioning)
  }

fold
跟aggregate功能类似，将RDD中元素聚集。
fold使用相同的函数op先进行分区内元素聚集，再进行分区间结果的聚集（所有涉及类型均为T类型），最后，返回聚集结果。

  /** * Aggregate the elements of each partition, and then the results for all the partitions, using a * given associative and commutative function and a neutral "zero value". The function * op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object * allocation; however, it should not modify t2. * * This behaves somewhat differently from fold operations implemented for non-distributed * collections in functional languages like Scala. This fold operation may be applied to * partitions individually, and then fold those results into the final result, rather than * apply the fold to each element sequentially in some defined ordering. For functions * that are not commutative, the result may differ from that of a fold applied to a * non-distributed collection. */
  def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
    // Clone the zero value since we will also be serializing it as part of tasks
    var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())
    val cleanOp = sc.clean(op)
    val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)
    val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)
    sc.runJob(this, foldPartition, mergeResult)
    jobResult
  }

foreach
对RDD的每一个函数应用f函数,返回空值（Unit)。
目的是为了取得副作用，如对上下文中变量的更改，或打印数据等等。
注意：worker节点打印数据会打印到自己终端上，master节点看不到。
foreachPartition
对RDD的每个分区元素迭代器应用f函数，返回空值（Unit)。
目的是为了取得副作用，如对上下文中变量的更改，或打印数据等等。
注意：worker节点打印数据会打印到自己终端上，master节点看不到。
foreachWith
对RDD的每个元素应用f函数，返回空值（Unit)。f函数另外接受一个参数，参数由分区号应用constructA函数产生，可以达到对RDD不同分区的元素进行不同操作目的。
详见mapPartitionsWithIndex。

  /** * Applies a function f to all elements of this RDD. */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

  /** * Applies a function f to each partition of this RDD. */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

  /** * Applies f to each element of this RDD, where f takes an additional parameter of type A. * This additional parameter is produced by constructA, which is called in each * partition with the index of that partition. */
  @deprecated("use mapPartitionsWithIndex and foreach", "1.0.0")
  def foreachWith[A](constructA: Int => A)(f: (T, A) => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    val cleanA = sc.clean(constructA)
    mapPartitionsWithIndex { (index, iter) =>
      val a = cleanA(index)
      iter.map(t => {cleanF(t, a); t})
    }
  }

getCheckpointFile
返回RDD的检查点文件存储目录的路径

  /** * Gets the name of the directory to which this RDD was checkpointed. * This is not defined if the RDD is checkpointed locally. */
  def getCheckpointFile: Option[String] = {
    checkpointData match {
      case Some(reliable: ReliableRDDCheckpointData[T]) => reliable.getCheckpointDir
      case _ => None
    }
  }

getDependencies
返回RDD的依赖序列（依赖链）

  /** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */
  protected def getDependencies: Seq[Dependency[_]] = deps

getPartitions
返回RDD分区对象的数组

  /** * Implemented by subclasses to return the set of partitions in this RDD. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */
  protected def getPartitions: Array[Partition]

getPreferredLocations
返回split分区在集群中的位置数组（多份存储容错机制）。

/** * Optionally overridden by subclasses to specify placement preferences. */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

getStorageLevel
返回RDD当前的存储级别

  /** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
  def getStorageLevel: StorageLevel = storageLevel

glom
将RDD的每个分区里面的元素组建成一个数组，返回RDD的每分区里只有一个元素——就是那个数组。

  /** * Return an RDD created by coalescing all elements within each partition into an array. */
  def glom(): RDD[Array[T]] = withScope {
    new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
  }

groupBy
将key-value型RDD的元素按key分组，key相同的value形成一个集合。返回key->value集合元素类型的RDD。
如:
1->a
1->b
2->c
2->b
3 ->2
转换为：
1->(a,b)
2->(c,b)
3->(2)
由于要记录所有的value，整个过程的代价很大的。当我们要做sum、average等累积计算操作时，只须记录value累积值即可，没必要记录所有value值。此时，更推荐采用aggregateByKey、reduceByKey等算子。

  /** * Return an RDD of grouped items. Each group consists of a key and a sequence of elements * mapping to that key. The ordering of elements within each group is not guaranteed, and * may even differ each time the resulting RDD is evaluated. * * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. */
  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

  def groupBy[K](
      f: T => K,
      numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy(f, new HashPartitioner(numPartitions))
  }

  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

intersection
返回两个RDD的交集元素组成的RDD，结果去重。

  /** * Return the intersection of this RDD and another one. The output will not contain any duplicate * elements, even if the input RDDs did. * * Note that this method performs a shuffle internally. */
  def intersection(other: RDD[T]): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  def intersection(
      other: RDD[T],
      partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

  def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    intersection(other, new HashPartitioner(numPartitions))
  }

isCheckpointed
返回RDD是否已经checkpointed

  /** * Return whether this RDD is marked for checkpointing, either reliably or locally. */
  def isCheckpointed: Boolean = checkpointData.exists(_.isCheckpointed)

isEmpty
判断RDD是否为空

  /** * @note due to complications in the internal implementation, this method will raise an * exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice * because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`. * (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.) * @return true if and only if the RDD contains no elements at all. Note that an RDD * may be empty even when it has at least 1 partition. */
  def isEmpty(): Boolean = withScope {
    partitions.length == 0 || take(1).length == 0
  }

keyBy
使用f函数为RDD每个元素产生key，返回key-value型的RDD

  /** * Creates tuples of the elements in this RDD by applying `f`. */
  def keyBy[K](f: T => K): RDD[(K, T)] = withScope {
    val cleanedF = sc.clean(f)
    map(x => (cleanedF(x), x))
  }

map
返回元素被f函数转换后的RDD

  /** * Return a new RDD by applying a function to all elements of this RDD. */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mapPartitions
运用f函数将RDD中每个分区的元素迭代器转换为新的元素迭代器，返回由这些新迭代器元素构成分区的RDD。
preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD，且key没有被修饰的情况下，可设为true。非K-V型RDD一般不存在分区器；K-V RDD key被修改后，元素将不再满足分区器的分区要求。这些情况下，须设为false，表示返回的RDD没有被分区器分过区。

  /** * Return a new RDD by applying a function to each partition of this RDD. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

mapPartitionsWithIndex
spark中一个很关键的算子。
运用f函数将RDD中每个分区的元素迭代器转换为新的元素迭代器，返回由这些新迭代器元素构成分区的RDD。
f函数还接受分区id作为参数，表示f可以依据分区id对不同分区进行区别操作。
preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD，且key没有被修饰的情况下，可设为true。非K-V型RDD一般不存在分区器；K-V RDD key被修改后，元素将不再满足分区器的分区要求。这些情况下，须设为false，表示返回的RDD没有被分区器分过区。

  /** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

max
返回RDD中最大元素

  /** * Returns the max of this RDD as defined by the implicit Ordering[T]. * @return the maximum element of the RDD * */
  def max()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.max)
  }

min
返回RDD中最小元素

  /** * Returns the min of this RDD as defined by the implicit Ordering[T]. * @return the minimum element of the RDD * */
  def min()(implicit ord: Ordering[T]): T = withScope {
    this.reduce(ord.min)
  }

parent
返回RDD的某个父RDD

  /** Returns the jth parent RDD: e.g. rdd.parent[T](0) is equivalent to rdd.firstParent[T] */
  protected[spark] def parent[U: ClassTag](j: Int) = {
    dependencies(j).rdd.asInstanceOf[RDD[U]]
  }

partitions
返回RDD分区对象的数组

  /** * Get the array of partitions of this RDD, taking into account whether the * RDD is checkpointed or not. */
  final def partitions: Array[Partition] = {
    checkpointRDD.map(_.partitions).getOrElse {
      if (partitions_ == null) {
        partitions_ = getPartitions
      }
      partitions_
    }
  }

persist
将RDD按指定存储级别缓存

  /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /** * Set this RDD's storage level to persist its values across operations after the first time * it is computed. This can only be used to assign a new storage level if the RDD does not * have a storage level set yet. Local checkpointing is an exception. */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

preferredLocations
返回split分区在集群中的位置数组（多份存储容错机制）。

  /** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */
  final def preferredLocations(split: Partition): Seq[String] = {
    checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
      getPreferredLocations(split)
    }
  }

reduce
返回RDD元素按f函数的“累加和”。

  /** * Reduces the elements of this RDD using the specified commutative and * associative binary operator. */
  def reduce(f: (T, T) => T): T = withScope {
    val cleanF = sc.clean(f)
    val reducePartition: Iterator[T] => Option[T] = iter => {
      if (iter.hasNext) {
        Some(iter.reduceLeft(cleanF))
      } else {
        None
      }
    }
    var jobResult: Option[T] = None
    val mergeResult = (index: Int, taskResult: Option[T]) => {
      if (taskResult.isDefined) {
        jobResult = jobResult match {
          case Some(value) => Some(f(value, taskResult.get))
          case None => taskResult
        }
      }
    }
    sc.runJob(this, reducePartition, mergeResult)
    // Get the final result out of our Option, or throw an exception if the RDD was empty
    jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
  }

repartition
按指定分区数重新分区RDD，存在shuffle。
当指定的分区数比当前分区数目少时，考虑使用coalesce，这样能够避免shuffle。

  /** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider using `coalesce`, * which can avoid performing a shuffle. */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

sample
返回由取样元素组成的RDD。
分为放回和不放回形式。放回时，fraction>=0，表示元素平均出现取样次数；不放回时，fraction属于[0,1]，表示元素被取样的概率。

  /** * Return a sampled subset of this RDD. * * @param withReplacement can elements be sampled multiple times (replaced when sampled out) * @param fraction expected size of the sample as a fraction of this RDD's size * without replacement: probability that each element is chosen; fraction must be [0, 1] * with replacement: expected number of times each element is chosen; fraction must be >= 0 * @param seed seed for the random number generator */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }

saveAsObjectFile
将RDD以序列化后的二进制形式保存至path

  /** * Save this RDD as a SequenceFile of serialized objects. */
  def saveAsObjectFile(path: String): Unit = withScope {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
      .saveAsSequenceFile(path)
  }

saveAsTextFile
将RDD以文本形式存储至path

  /** * Save this RDD as a text file, using string representations of elements. */
  def saveAsTextFile(path: String): Unit = withScope {
    // https://issues.apache.org/jira/browse/SPARK-2075
    //
    // NullWritable is a `Comparable` in Hadoop 1.+, so the compiler cannot find an implicit
    // Ordering for it and will use the default `null`. However, it's a `Comparable[NullWritable]`
    // in Hadoop 2.+, so the compiler will call the implicit `Ordering.ordered` method to create an
    // Ordering for `NullWritable`. That's why the compiler will generate different anonymous
    // classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+.
    //
    // Therefore, here we provide an explicit Ordering `null` to make sure the compiler generate
    // same bytecodes for `saveAsTextFile`.
    val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
    val textClassTag = implicitly[ClassTag[Text]]
    val r = this.mapPartitions { iter =>
      val text = new Text()
      iter.map { x =>
        text.set(x.toString)
        (NullWritable.get(), text)
      }
    }
    RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
  }

 def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit

sortBy
返回元素按f函数转换值进行排序后的RDD。
默认按转换值的升序排序，分区数目保持不变。

  /** * Return this RDD sorted by the given key function. */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

subtract
返回在当前RDD中存，在other中不存在的元素组成的RDD。

  /** * Return an RDD with the elements from `this` that are not in `other`. * * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting * RDD will be &lt;= us. */
  def subtract(other: RDD[T]): RDD[T] = withScope {
    subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
  }

  /** * Return an RDD with the elements from `this` that are not in `other`. */
  def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
    subtract(other, new HashPartitioner(numPartitions))
  }

  /** * Return an RDD with the elements from `this` that are not in `other`. */
  def subtract(
      other: RDD[T],
      p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

take
返回RDD的前num个元素组成的数组，并未进行大小排序。

  /** * Take the first num elements of the RDD. It works by first scanning one partition, and use the * results from that partition to estimate the number of additional partitions needed to satisfy * the limit. * * @note due to complications in the internal implementation, this method will raise * an exception if called on an RDD of `Nothing` or `Null`. */
  def take(num: Int): Array[T]

takeOrdered
返回RDD排序（升序)后，前num个元素组成的数组。

  /** * Returns the first k (smallest) elements from this RDD as defined by the specified * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]]. * For example: * {{{ * sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1) * // returns Array(2) * * sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2) * // returns Array(2, 3) * }}} * * @param num k, the number of elements to return * @param ord the implicit ordering for T * @return an array of top elements */
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    if (num == 0) {
      Array.empty
    } else {
      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
        Iterator.single(queue)
      }
      if (mapRDDs.partitions.length == 0) {
        Array.empty
      } else {
        mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1
        }.toArray.sorted(ord)
      }
    }
  }

takeSample
返回RDD中取样num元素组成的数组。
分为放回和不放回形式，放回时，fraction>=0，表示元素平均出现取样次数；不放回时，fraction属于[0,1]，表示元素被取样的概率。

  /** * Return a fixed-size sampled subset of this RDD in an array * * @param withReplacement whether sampling is done with replacement * @param num size of the returned sample * @param seed seed for the random number generator * @return sample of specified size in an array */
  // TODO: rewrite this without return statements so we can wrap it in a scope
  def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

toLocalIterator
返回RDD中所有元素的迭代器

  /** * Return an iterator that contains all of the elements in this RDD. * * The iterator will consume as much memory as the largest partition in this RDD. * * Note: this results in multiple Spark jobs, and if the input RDD is the result * of a wide transformation (e.g. join with different partitioners), to avoid * recomputing the input RDD should be cached first. */
  def toLocalIterator: Iterator[T] = withScope {
    def collectPartition(p: Int): Array[T] = {
      sc.runJob(this, (iter: Iterator[T]) => iter.toArray, Seq(p)).head
    }
    (0 until partitions.length).iterator.flatMap(i => collectPartition(i))
  }

top
返回RDD中的top num组成的数组，即从大到小排序后的前num个元素。

  /** * Returns the top k (largest) elements from this RDD as defined by the specified * implicit Ordering[T]. This does the opposite of [[takeOrdered]]. For example: * {{{ * sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1) * // returns Array(12) * * sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2) * // returns Array(6, 5) * }}} * * @param num k, the number of top elements to return * @param ord the implicit ordering for T * @return an array of top elements */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)
  }

treeAggregate

  /** * Aggregates the elements of this RDD in a multi-level tree pattern. * * @param depth suggested depth of the tree (default: 2) * @see [[org.apache.spark.rdd.RDD#aggregate]] */
  def treeAggregate[U: ClassTag](zeroValue: U)(
      seqOp: (U, T) => U,
      combOp: (U, U) => U,
      depth: Int = 2): U

treeReduce

 /** * Reduces the elements of this RDD in a multi-level tree pattern. * * @param depth suggested depth of the tree (default: 2) * @see [[org.apache.spark.rdd.RDD#reduce]] */
  def treeReduce(f: (T, T) => T, depth: Int = 2): T

union
返回两个RDD元素的并集组成的RDD，不去重

  /** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */
  def union(other: RDD[T]): RDD[T] = withScope {
    if (partitioner.isDefined && other.partitioner == partitioner) {
      new PartitionerAwareUnionRDD(sc, Array(this, other))
    } else {
      new UnionRDD(sc, Array(this, other))
    }
  }

unpersist
释放RDD的缓存

  /** * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. * * @param blocking Whether to block until all blocks are deleted. * @return This RDD. */
  def unpersist(blocking: Boolean = true): this.type = {
    logInfo("Removing RDD " + id + " from persistence list")
    sc.unpersistRDD(id, blocking)
    storageLevel = StorageLevel.NONE
    this
  }

zip
将RDD的元素与other中元素一一对应起来构建元素对，返回元素对组成的RDD。
两个RDD的分区数目，对应分区里的元素数目都必须相等。
zip很形象，两个RDD中元素像拉拉链一样地组合。

  /** * Zips this RDD with another one, returning key-value pairs with the first element in each RDD, * second element in each RDD, etc. Assumes that the two RDDs have the *same number of * partitions* and the *same number of elements in each partition* (e.g. one was made through * a map on the other). */
  def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>
      new Iterator[(T, U)] {
        def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {
          case (true, true) => true
          case (false, false) => false
          case _ => throw new SparkException("Can only zip RDDs with " +
            "same number of elements in each partition")
        }
        def next(): (T, U) = (thisIter.next(), otherIter.next())
      }
    }
  }

zipPartitions
将两个或多个RDD中分区元素的迭代器一一对应起来，用来产生一个元素迭代器，返回以这个分区迭代器元素作为分区的RDD。
要求两个或多个RDD具有相同分区数。

  /** * Zip this RDD's partitions with one (or more) RDD(s) and return a new RDD by * applying a function to the zipped partitions. Assumes that all the RDDs have the * *same number of partitions*, but does *not* require them to have the same number * of elements in each partition. */
  def zipPartitions[B: ClassTag, V: ClassTag]
      (rdd2: RDD[B], preservesPartitioning: Boolean)
      (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
    new ZippedPartitionsRDD2(sc, sc.clean(f), this, rdd2, preservesPartitioning)
  }

  def zipPartitions[B: ClassTag, V: ClassTag]
      (rdd2: RDD[B])
      (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {
    zipPartitions(rdd2, preservesPartitioning = false)(f)
  }

  def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag]
      (rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)
      (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V] = withScope {
    new ZippedPartitionsRDD3(sc, sc.clean(f), this, rdd2, rdd3, preservesPartitioning)
  }

  def zipPartitions[B: ClassTag, C: ClassTag, V: ClassTag]
      (rdd2: RDD[B], rdd3: RDD[C])
      (f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V] = withScope {
    zipPartitions(rdd2, rdd3, preservesPartitioning = false)(f)
  }

zipWithIndex
将RDD的元素跟其序号（以0开始的自然数序列）一一对应起来，组成元素对，返回这些元素对组成的RDD。
由于需要为每个分区的第一个元素确定起始序号，所以在执行时要先统计每个分区的元素数量，按分区号累加得到起始编号。这会触发一次作业。

  /** * Zips this RDD with its element indices. The ordering is first based on the partition index * and then the ordering of items within each partition. So the first item in the first * partition gets index 0, and the last item in the last partition receives the largest index. * * This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. * This method needs to trigger a spark job when this RDD contains more than one partitions. * * Note that some RDDs, such as those returned by groupBy(), do not guarantee order of * elements in a partition. The index assigned to each element is therefore not guaranteed, * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee * the same index assignments, you should sort the RDD with sortByKey() or save it to a file. */
  def zipWithIndex(): RDD[(T, Long)] = withScope {
    new ZippedWithIndexRDD(this)
  }

zipWithUniqueId
将RDD的元素跟其序号一一对应起来，组成元素对，返回这些元素对组成的RDD。
第k（分区id起始于0）个分区的元素序号编排如下：
k,k+n,k+2n,k+3n,k+4n,k+5n,…
由于并不需要为每个分区计算起始编号，该算子不会触发一次作业。

  /** * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]]. * * Note that some RDDs, such as those returned by groupBy(), do not guarantee order of * elements in a partition. The unique ID assigned to each element is therefore not guaranteed, * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee * the same index assignments, you should sort the RDD with sortByKey() or save it to a file. */
  def zipWithUniqueId(): RDD[(T, Long)] = withScope {
    val n = this.partitions.length.toLong
    this.mapPartitionsWithIndex { case (k, iter) =>
      iter.zipWithIndex.map { case (item, i) =>
        (item, i * n + k)
      }
    }
  }

作者简介

唐黎哲，国防科学技术大学并行与分布式计算国家重点实验室(PDL)研究生，14年入学便开始接触spark，准备在余下的读研时间在spark相关开源社区贡献自己的代码，毕业后准备继续从事该方面研究。
邮箱：[email protected]

你可能感兴趣的:(scala,spark,RDD,RDD算子)

基于图像处理的裂缝宽度检测系统-matlab 人工智能专属驿站计算机视觉图像处理人工智能
图像处理技术广泛地应用于桥梁、房屋、道路等工程施工中出现的表面裂缝,利用数字图像处理技术来测量结构物表面裂缝宽度是一种无损检测方法.基于图像处理的裂缝宽度检测系统需采用的图像处理算法有:（1）读取裂缝图像；（2）图像转化为灰度图像；（3）图像的增强；（4）平滑滤波；（5）阈值分割；（6）形态学去噪；（7）边缘检测(Canny算子)；（8）边缘坐标点的提取；结果见：源程序见：基于图像处理的裂缝宽度检
matlab 延迟算子,时间序列分析-----2---时间序列预处理这块必被安排 matlab 延迟算子
既然有了序列，那怎么拿来分析呢？时间序列分析方法分为描述性时序分析和统计时序分析。1、描述性时序分析通过直观的数据比较或绘图观测，寻找序列中蕴含的发展规律，这种分析方法就称为描述性时序分析。描述性时序分析方法具有操作简单、直观有效的特点，它通常是人们进行统计时序分析的第一步。2、统计时序分析(1)频域分析方法原理：假设任何一种无趋势的时间序列都可以分解成若干不同频率的周期波动发展过程：1)早期的频
Spark 性能优化（四）：Cache LevenBigData spark 性能调优 spark 性能优化大数据
在Spark中，缓存是一种将计算结果存储在内存中的方式，目的是加速后续操作。当你执行迭代算法或查询时，如果多次重复使用相同的数据集，缓存可以避免每次都重新计算相同的转换操作。通过缓存，Spark可以将数据存储在内存中，这样在后续的处理阶段就能更快地访问。1.Spark缓存的关键点：缓存基本概念：通过调用.cache()对DataFrame或RDD进行缓存。默认情况下，数据会存储在内存中（RAM），
使用Docker搭建Flink集群 O_1CxH Flink大数据 Kafka大数据 docker flink 容器
目录使用Docker搭建Flink集群docker-compose一键搭建步骤附录参考资料使用Docker搭建Flink集群在学习大数据框架的时候，需要一个真实的环境。我们知道，像spark、flink这些计算框架都有多种运行模式：在本地使用多线程模拟集群真正的分布式集群如果直接在IDE（Intellj）里面编译和运行写好的程序，实际上是用的前一种运行模式；如果想尝试真正的生产环境中任务的提交和管
WPF: 把引用的dll移动到自定义路径 zlcntt UI编程（C#）dll
需求：有A.exe和B.exe,都引用了C.dll，output路径都是W:\Debug.A和B都添加了对C的引用，正常情况下C会被复制到output里面。C这样子的dll很多，不想把它们和exe放在同一级的目录，移动到子目录，如W:\Debug\3rdDll办法：1.首先设置C.dll打开ProjectA的References，选中C.dll,右键Properties，CopyLocal设为Fa
Spark 和 Flink 信徒_ spark flink 大数据
Spark和Flink都是目前流行的大数据处理引擎，但它们在架构设计、应用场景、性能和生态方面有较大区别。以下是详细对比：1.架构与核心概念方面ApacheSparkApacheFlink计算模型微批（Micro-Batch）为主，但支持结构化流（StructuredStreaming）原生流（TrueStreaming），基于事件驱动处理方式以RDD、DataFrame/Dataset作为核心抽
spark任务运行冰火同学 Spark spark 大数据分布式
运行环境在这里插入代码片[root@hadoop000conf]#java-versionjavaversion"1.8.0_144"Java(TM)SERuntimeEnvironment(build1.8.0_144-b01)[root@hadoop000conf]#echo$JAVA_HOME/home/hadoop/app/jdk1.8.0_144[root@hadoop000conf]#
【Redis】golang操作Redis基础入门寸铁 go 数据库 Redis redis golang 数据库 CRUD 基本操作分布式键值对
【Redis】golang操作Redis基础入门大家好我是寸铁总结了一篇【Redis】golang操作Redis基础入门sparkles:喜欢的小伙伴可以点点关注Redis的作用Redis（RemoteDictionaryServer）是一个开源的内存数据库，它主要用于存储键值对，并提供多种数据结构的支持。Redis的主要作用包括：1.缓存:Redis可以作为缓存系统，将常用的数据缓存在内存中，以
hive spark读取hive hbase外表报错分析和解决 spring208208 hive hive spark hbase
问题现象使用Sparkshell操作hive关联Hbase的外表导致报错；hive使用tez引擎操作关联Hbase的外表时报错。问题1：使用tez或spark引擎，在hive查询时只要关联hbase的hive表就会有问题其他表正常。“org.apache.hadoop.hbase.client.RetriesExhaustedException:Can’tgetthelocations”问题2：s
spark-广播变量哈哈哈哈q +spark hdfs hadoop 大数据 spark
当本地数据极大的时候，可以使用广播变量，使得减少内存。本地集合对象和分布式集合对象（RDD）进行关联的时候，需要将本地集合对象广播变量。本地的数据传输到集群上，会发到每一个线程，每一个分区。每一个进程executor，有多个线程分区，进程内的线程数据共享因此，给每一个线程发送数据会导致数据占用，浪费资源。所有，出现了广播变量，使得只发送给进程代码使用：broadcast=sc.broadcast(
探索数据云的无缝桥梁：Apache Spark 与 Snowflake 的完美结合窦育培
探索数据云的无缝桥梁：ApacheSpark与Snowflake的完美结合spark-snowflakeSnowflakeDataSourceforApacheSpark.项目地址:https://gitcode.com/gh_mirrors/sp/spark-snowflake项目介绍在大数据处理的浩瀚宇宙中，Snowflake以其独特的云数据仓库能力闪耀，而ApacheSpark则是数据分析和
maven插件学习(maven-shade-plugin和maven-antrun-plugin插件) catcher92 java maven maven 学习大数据
整合spark3.3.x和hive2.1.1-cdh6.3.2碰到个问题，就是spark官方支持的hive是2.3.x，但是cdh中的hive确是2.1.x的，项目中又计划用spark-thrift-server，导致编译过程中有部分报错。其中OperationLog这个类在hive2.3中新增加了几个方法，导致编译报错。这个时候有两种解决办法：修改spark源码，注释掉调用OperationLo
使用SparkLLM实现智能聊天：技术原理与实战演示 shuoac java
在本篇文章中，我们将探讨如何使用iFlyTek的SparkLLM模型来实现智能聊天功能。我们将详细介绍SparkLLM的技术背景、核心原理，并通过实际代码展示如何进行实现。另外，还会分析应用场景并给出一些实践建议。技术背景介绍SparkLLM是由iFlyTek提供的一种强大的语言模型，支持多种语言生成任务。它能够理解并生成自然语言，适用于对话系统、内容生成、智能客服等场景。核心原理解析SparkL
Spark 性能优化（三）：RBO 与 CBO LevenBigData spark 性能调优 spark 性能优化 ajax
1.RBO的核心概念在ApacheSpark的查询优化过程中，规则优化（Rule-BasedOptimization,RBO）是Catalyst优化器的一个关键组成部分。它主要依赖于一组固定的规则进行优化，而不是基于统计信息（如CBO-Cost-BasedOptimization）。RBO主要通过一系列逻辑规则（LogicalRules）和物理规则（PhysicalRules）来转换和优化查询计划
python 并行框架_基于python的高性能实时并行机器学习框架之Ray介绍 weixin_39778582 python 并行框架
前言加州大学伯克利分校实时智能安全执行实验室(RISELab)的研究人员已开发出了一种新的分布式框架，该框架旨在让基于Python的机器学习和深度学习工作负载能够实时执行，并具有类似消息传递接口(MPI)的性能和细粒度。这种框架名为Ray，看起来有望取代Spark，业界认为Spark对于一些现实的人工智能应用而言速度太慢了;过不了一年，Ray应该会准备好用于生产环境。目前ray已经发布了0.3.0
java获取hive表所有字段,Hive Sql从表中动态获取空列计数拾亿年 java获取hive表所有字段
我正在使用datastaxspark集成和sparkSQLthrift服务器,它为我提供了一个HiveSQL接口来查询Cassandra中的表.我的数据库中的表是动态创建的,我想要做的是仅根据表名在表的每列中获取空值的计数.我可以使用describedatabase.table获取列名,但在hiveSQL中,如何在另一个为所有列计数null的select查询中使用其输出.更新1：使用Dudu的解决
PySpark查询Dataframe中包含乱码的数据记录的方法 weixin_30777913 python 大数据 spark
首先，用PySpark获取Dataframe中所有非ASCII字符，找到其中的非乱码字符。frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimportcol,concat_ws,explode,split,coalesce,litfrompyspark.sql.typesimportStringTypespark=SparkSes
3.3.2 VO-O语法- 算子-原子算子 colorknight 低代码数据科学 HuggingFists 人工智能可视化编程算子 VO
算子是VO语言最核心的概念，是可视化编程的基础。在编程时，每个算子都以一个独立图元的方式进行表达。其主要分为原子算子和容器算子两种结构类型。原子算子算子轮廓原子算子是算子的最小单位，其承载了某个具体的数据处理功能，比如：文件输入算子、Json读取算子、XML写出算子等。其界面的表达如下：数据输入端口：数据集通过该端口传入算子。每个算子可以有[0,n]个数据输入端口，这要视算子的功能而定。数据处理类
spark streaming基础操作天选之子123 大数据 spark 大数据分布式
sparkstreaming基础操作一、什么是sparkstreamingSparkStreaming用于流式数据的处理。SparkStreaming使用离散化流(discretized作为抽象表示，叫作DStream。DStream是随时间推移而收到的数据的序列。在内部，每个时间区间收到的数据都作为RDD存在，而DStream是由这些RDD所组成的序列(因此得名“离散化”)。简单来说，DStre
Flink-提交job 笨鸟先-森大数据 flink
目录一、Flink流处理扩展及说明二、Flink部署三、Standalone模式四、在命令行提交job：五、在网页中提交flinkjob一、Flink流处理扩展及说明涉及：自定义线程优先级=socket流中读取数据并行度只能是11、特定的算子设定了并行度最优先2、算子没有设定并行度就是用整体运行环境设置的并行度3、环境的并行度没有设置就使用提交时候提交参数设置的并行度4、都没有设置就遵循flink
华为 MindStudio 安装指南丰年稻香人工智能 python 人工智能
1.MindStudio介绍华为MindStudio是一款集成开发环境（IDE），用于AscendAI处理器的开发调试。它支持模型训练、推理、算子开发、性能优化等AI任务，并依赖CANN（ComputeArchitectureforNeuralNetworks）作为计算架构基础。本指南介绍如何在KunLunG2280服务器上安装MindStudio，包括环境准备、依赖安装、CANN安装及MindS
scala kotlin比较_追随 Kotlin/Scala，看 Java 12-15 的现代语言特性 weixin_39605296 scala kotlin比较 scala list 接受java string
本文原发于我的个人博客：https://hltj.me/java/2020/06/14/java-12-15-lang-features.html。本副本只用于知乎，禁止第三方转载。Java14发布已经过去了三个月，Java15目前也已经到了“RampdownPhaseOne”阶段，其新特性均已敲定。由于12-15都是短期版本，无需考虑也不应该将其用于生产环境。但可以提前了解新特性，以免在下一个L
java 协程 scala_追随 Kotlin/Scala，看 Java 12-15 的现代语言特性小田linda java 协程 scala
Java14发布已经过去了三个月，Java15目前也已经到了“RampdownPhaseOne”阶段，其新特性均已敲定。由于12-15都是短期版本，无需考虑也不应该将其用于生产环境。但可以提前了解新特性，以免在下一个LTS(Java17)正式发布时毫无心理准备。Java12-15引入了一系列改进，本文只讨论语言层面的新特性，它们看起来似曾相识——没错，这些特性让人感觉Java在沿Kotlin/Sc
云计算服务中的“无缝扩展”是什么意思云上的阿七云计算
“无缝扩展”（SeamlessScalability）是云计算服务中的一个重要概念，指的是云平台能够根据需求变化自动、平滑地扩展或缩减资源，而不影响系统的正常运行或用户体验。简单来说，就是云服务可以在负载增加或减少时，灵活地调整计算资源、存储容量或网络带宽，且这一过程对用户来说是透明的，没有明显的中断或影响。为什么“无缝扩展”很重要？在现代的云计算环境中，应用和服务的负载会随着时间、流量或业务需求
timescaladb时序数据库高可用docker镜像使用 handsomestWei 数据库时序数据库 docker 数据库 timescaladb postgresql
timescaladb时序数据库高可用docker镜像使用timescaladb时序数据库高可用，基于bitnami/postgresql-repmgrdocker镜像制作，实现数据同步和故障自动转移主备切换。使用示例参考，附dockercompose配置例。pg-0:image:wjy2020/timescaledb-repmgr:pg14.15-ts2.17.2container_name:"
keepalived+timescaladb主备切换高可用方案 handsomestWei 数据库 keepalived timescaladb postgresql 数据库高可用
keepalived+timescaladb主备切换高可用方案环境和组件依赖ubuntu22.04，docker引擎keepalivedv2.2.4timescaledbdocker镜像wjy2020/timescaledb-repmgr:pg14.15-ts2.17.2，镜像使用参考方案思路在双机分别部署这两个组件，keepalived定时检测timescaladb数据库的主备状态，当数据库状态
电路设计仿真软件：OrCAD_（13）.多板设计与管理 kkchenjj 电路仿真模拟仿真仿真模拟电路仿真
多板设计与管理在现代电子设计中，多板设计（Multi-BoardDesign）变得越来越常见，特别是在复杂系统和大型项目中。多板设计涉及多个PCB板的协同设计与管理，确保各个板之间信号的正确传输和系统的整体性能。本节将详细介绍如何在OrCAD中进行多板设计与管理，包括多板设计的基本概念、多板设计的流程、以及如何通过二次开发提高多板设计的效率和准确性。多板设计的基本概念多板设计是指在一个项目中同时设
flink实时集成利器 - apache seatunnel - 核心架构详解 24k小善 flink apache 架构
SeaTunnel（原名Waterdrop）是一个分布式、高性能、易扩展的数据集成平台，专注于大数据领域的数据同步、数据迁移和数据转换。它支持多种数据源和数据目标，并可以与ApacheFlink、Spark等计算引擎集成。以下是SeaTunnel的核心架构详解：SeaTunnel核心架构SeaTunnel的架构设计分为以下几个核心模块：1.数据源（Source）功能：负责从外部系统读取数据。支持的
DeepSeek核心成员专访，顶级团队的思维与执行力恐怖如斯 - 1 2402_86608154 666 运维网络服务器
团队的工作氛围与创新精神在与DeepSeek团队的核心成员合作时，他们给我带来的第一感觉是“快乐”，这不仅仅是因为大家都拥有卓越的技能，更因为能够与一群天才一起工作，是一种无与伦比的幸运。当我们提出一个问题时，总是能够获得无数的回响，而不是遇到没有思路的团队成员。每一个想法都会被认真讨论，每个人都能为问题的解决贡献自己的智慧和力量。例如，我们曾在讨论一个核心算子时，发现其GPU使用率并不高，我们尝
DS缩写乱争：当小海豚撞上AI顶流，技术圈也逃不过“撞名”修罗场数据库
DS缩写风云：从“小海豚”到“深度求索”的魔幻现实曾几何时，技术圈提到DS，人们脑海中浮现的是一只灵动的“小海豚”——ApacheDolphinScheduler（简称DS）。这个2019年诞生的分布式任务调度系统，凭借可视化DAG界面、多租户支持和对Hadoop/Spark生态的深度集成，一度是大数据工程师的“梦中情工”。然而，命运的齿轮在2025年初突然加速转动：杭州AI公司DeepSeek（
mondb入手木zi_鸣 mongodb
windows 启动mongodb 编写bat文件， mongod --dbpath D:\software\MongoDBDATA mongod --help 查询各种配置配置在mongob 打开批处理，即可启动，27017原生端口，shell操作监控端口扩展28017，web端操作端口启动配置文件配置，数据更灵活
大型高并发高负载网站的系统架构 bijian1013 高并发负载均衡
扩展Web应用程序一.概念简单的来说，如果一个系统可扩展，那么你可以通过扩展来提供系统的性能。这代表着系统能够容纳更高的负载、更大的数据集，并且系统是可维护的。扩展和语言、某项具体的技术都是无关的。扩展可以分为两种： 1.
DISPLAY变量和xhost(原创) czmmiao display
DISPLAY 在Linux/Unix类操作系统上, DISPLAY用来设置将图形显示到何处. 直接登陆图形界面或者登陆命令行界面后使用startx启动图形, DISPLAY环境变量将自动设置为:0:0, 此时可以打开终端, 输出图形程序的名称(比如xclock)来启动程序, 图形将显示在本地窗口上, 在终端上输入printenv查看当前环境变量, 输出结果中有如下内容:DISPLAY=:0.0
获取B/S客户端IP 周凡杨 java 编程 jsp Web 浏览器
最近想写个B/S架构的聊天系统，因为以前做过C/S架构的QQ聊天系统，所以对于Socket通信编程只是一个巩固。对于C/S架构的聊天系统，由于存在客户端Java应用，所以直接在代码中获取客户端的IP，应用的方法为： String ip = InetAddress.getLocalHost().getHostAddress(); 然而对于WEB
浅谈类和对象朱辉辉33 编程
类是对一类事物的总称，对象是描述一个物体的特征，类是对象的抽象。简单来说，类是抽象的，不占用内存，对象是具体的，占用存储空间。类是由属性和方法构成的，基本格式是public class 类名{ //定义属性 private/public 数据类型属性名； //定义方法 publ
android activity与viewpager+fragment的生命周期问题肆无忌惮_ viewpager
有一个Activity里面是ViewPager，ViewPager里面放了两个Fragment。第一次进入这个Activity。开启了服务，并在onResume方法中绑定服务后，对Service进行了一定的初始化，其中调用了Fragment中的一个属性。 super.onResume(); bindService(intent, conn, BIND_AUTO_CREATE);
base64Encode对图片进行编码 843977358 base64 图片 encoder
/** * 对图片进行base64encoder编码 * * @author mrZhang * @param path * @return */ public static String encodeImage(String path) { BASE64Encoder encoder = null; byte[] b = null; I
Request Header简介 aigo servlet
当一个客户端(通常是浏览器)向Web服务器发送一个请求是，它要发送一个请求的命令行，一般是GET或POST命令，当发送POST命令时，它还必须向服务器发送一个叫“Content-Length”的请求头(Request Header) 用以指明请求数据的长度，除了Content-Length之外，它还可以向服务器发送其它一些Headers，如：
HttpClient4.3 创建SSL协议的HttpClient对象 alleni123 httpclient 爬虫 ssl
public class HttpClientUtils { public static CloseableHttpClient createSSLClientDefault(CookieStore cookies){ SSLContext sslContext=null; try { sslContext=new SSLContextBuilder().l
java取反 -右移-左移-无符号右移的探讨百合不是茶位运算符位移
取反：在二进制中第一位，1表示符数，0表示正数 byte a = -1; 原码：10000001 反码：11111110 补码：11111111 //异或: 00000000 byte b = -2; 原码：10000010 反码：11111101 补码：11111110 //异或: 00000001
java多线程join的作用与用法 bijian1013 java 多线程
对于JAVA的join，JDK 是这样说的：join public final void join （long millis ）throws InterruptedException Waits at most millis milliseconds for this thread to die. A timeout of 0 means t
Java发送http请求(get 与post方法请求) bijian1013 java spring
PostRequest.java package com.bijian.study; import java.io.BufferedReader; import java.io.DataOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURL
【Struts2二】struts.xml中package下的action配置项默认值 bit1129 struts.xml
在第一部份，定义了struts.xml文件，如下所示： <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache.org/dtds/struts
【Kafka十三】Kafka Simple Consumer bit1129 simple
代码中关于Host和Port是割裂开的，这会导致单机环境下的伪分布式Kafka集群环境下，这个例子没法运行。实际情况是需要将host和port绑定到一起， package kafka.examples.lowlevel; import kafka.api.FetchRequest; import kafka.api.FetchRequestBuilder; impo
nodejs学习api ronin47 nodejs api
NodeJS基础什么是NodeJS JS是脚本语言，脚本语言都需要一个解析器才能运行。对于写在HTML页面里的JS，浏览器充当了解析器的角色。而对于需要独立运行的JS，NodeJS就是一个解析器。每一种解析器都是一个运行环境，不但允许JS定义各种数据结构，进行各种计算，还允许JS使用运行环境提供的内置对象和方法做一些事情。例如运行在浏览器中的JS的用途是操作DOM，浏览器就提供了docum
java-64.寻找第N个丑数 bylijinnan java
public class UglyNumber { /** * 64.查找第N个丑数具体思路可参考 [url] http://zhedahht.blog.163.com/blog/static/2541117420094245366965/[/url] * 题目：我们把只包含因子 2、3和5的数称作丑数（Ugly Number）。例如6、8都是丑数，但14
二维数组（矩阵）对角线输出 bylijinnan 二维数组
/** 二维数组对角线输出两个方向例如对于数组： { 1, 2, 3, 4 }, { 5, 6, 7, 8 }, { 9, 10, 11, 12 }, { 13, 14, 15, 16 }, slash方向输出： 1 5 2 9 6 3 13 10 7 4 14 11 8 15 12 16 backslash输出： 4 3
[JWFD开源工作流设计]工作流跳跃模式开发关键点(今日更新) comsci 工作流
既然是做开源软件的,我们的宗旨就是给大家分享设计和代码,那么现在我就用很简单扼要的语言来透露这个跳跃模式的设计原理大家如果用过JWFD的ARC-自动运行控制器,或者看过代码,应该知道在ARC算法模块中有一个函数叫做SAN(),这个函数就是ARC的核心控制器,要实现跳跃模式,在SAN函数中一定要对LN链表数据结构进行操作,首先写一段代码,把
redis常见使用 cuityang redis 常见使用
redis 通常被认为是一个数据结构服务器，主要是因为其有着丰富的数据结构 strings、map、 list、sets、 sorted sets 引入jar包 jedis-2.1.0.jar (本文下方提供下载) package redistest; import redis.clients.jedis.Jedis; public class Listtest
配置多个redis dalan_123 redis
配置多个redis客户端 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi=&quo
attrib命令 dcj3sjt126com attr
attrib指令用于修改文件的属性.文件的常见属性有:只读.存档.隐藏和系统. 只读属性是指文件只可以做读的操作.不能对文件进行写的操作.就是文件的写保护. 存档属性是用来标记文件改动的.即在上一次备份后文件有所改动.一些备份软件在备份的时候会只去备份带有存档属性的文件.
Yii使用公共函数 dcj3sjt126com yii
在网站项目中，没必要把公用的函数写成一个工具类，有时候面向过程其实更方便。在入口文件index.php里添加 require_once('protected/function.php'); 即可对其引用，成为公用的函数集合。 function.php如下： <?php /** * This is the shortcut to D
linux 系统资源的查看（free、uname、uptime、netstat） eksliang netstat linux uname linux uptime linux free
linux 系统资源的查看转载请出自出处：http://eksliang.iteye.com/blog/2167081 http://eksliang.iteye.com 一、free查看内存的使用情况语法如下： free [-b][-k][-m][-g] [-t] 参数含义 -b:直接输入free时，显示的单位是kb我们可以使用b(bytes),m
JAVA的位操作符 greemranqq 位运算 JAVA位移 <<>>>
最近几种进制，加上各种位操作符，发现都比较模糊，不能完全掌握，这里就再熟悉熟悉。 1.按位操作符：按位操作符是用来操作基本数据类型中的单个bit,即二进制位，会对两个参数执行布尔代数运算，获得结果。与（&）运算： 1&1 = 1, 1&0 = 0, 0&0 &
Web前段学习网站 ihuning Web
Web前段学习网站菜鸟学习：http://www.w3cschool.cc/ JQuery中文网：http://www.jquerycn.cn/ 内存溢出：http://outofmemory.cn/#csdn.blog http://www.icoolxue.com/ http://www.jikexue
强强联合：FluxBB 作者加盟 Flarum justjavac r
原文：FluxBB Joins Forces With Flarum作者：Toby Zerner译文：强强联合：FluxBB 作者加盟 Flarum译者：justjavac FluxBB 是一个快速、轻量级论坛软件，它的开发者是一名德国的 PHP 天才 Franz Liedke。FluxBB 的下一个版本(2.0)将被完全重写，并已经开发了一段时间。FluxBB 看起来非常有前途的，
java统计在线人数（session存储信息的） macroli java Web
这篇日志是我写的第三次了前两次都发布失败！郁闷极了！由于在web开发中常常用到这一部分所以在此记录一下，呵呵，就到备忘录了！我对于登录信息时使用session存储的，所以我这里是通过实现HttpSessionAttributeListener这个接口完成的。 1、实现接口类，在web.xml文件中配置监听类，从而可以使该类完成其工作。 public class Ses
bootstrp carousel初体验快速构建图片播放 qiaolevip 每天进步一点点学习永无止境 bootstrap 纵观千象
img{ border: 1px solid white; box-shadow: 2px 2px 12px #333; _width: expression(this.width > 600 ? "600px" : this.width + "px"); _height: expression(this.width &
SparkSQL读取HBase数据，通过自定义外部数据源 superlxw1234 spark sparksql sparksql读取hbase sparksql外部数据源
关键字：SparkSQL读取HBase、SparkSQL自定义外部数据源前面文章介绍了SparSQL通过Hive操作HBase表。 SparkSQL从1.2开始支持自定义外部数据源(External DataSource)，这样就可以通过API接口来实现自己的外部数据源。这里基于Spark1.4.0，简单介绍SparkSQL自定义外部数据源，访
Spring Boot 1.3.0.M1发布 wiselyman spring boot
Spring Boot 1.3.0.M1于6.12日发布，现在可以从Spring milestone repository下载。这个版本是基于Spring Framework 4.2.0.RC1,并在Spring Boot 1.2之上提供了大量的新特性improvements and new features。主要包含以下： 1.提供一个新的sprin