阿卷啦

Spark-core 转换算子（九）

Transformations 算子详解二

上一篇，我们主要分析了一下简单的转换算子，这里我们先分析一下常见的转换算子。

1、groupBy算子

groupBy算子如其名，分组算子。但是我们需要制定分组函数。它和groupByKey不同，groupByKey直接按照key分组。
源码部分:

  def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    groupBy[K](f, defaultPartitioner(this))
  }

  def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
      : RDD[(K, Iterable[T])] = withScope {
    val cleanF = sc.clean(f)
    this.map(t => (cleanF(t), t)).groupByKey(p)
  }

通过源码发现，groupBy的底层任仍然是groupByKey算子，经过map得到灭搁置的一个函数标签，类似于键值对，在按照groupByKey分分组得到RDD[(K, Iterable[T])]这样的结果类型。
用法：

    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
    val sc = new SparkContext(conf)
    val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9))
    val value: RDD[(Boolean, Iterable[Int])] = rdd.groupBy((x: Int) => x % 2 == 0)
	    println(value.count())
	    println(value.max())
	    value.foreach((iter: (Boolean, Iterable[Int])) =>{
      	println(iter._2.toList.mkString(","))
    	})
  }

在上面的groupBy后，我们可以做一些分组统计，比如：max、min、count等等。
也可以直接输出进行下一步的操作。

2、groupByKey算子

groupByKey类似于groupBy算子，就是分组，但是groupByKey是根据键值对的键值进行分组。返回值是RDD[String, Iterable[Int]]
源码：

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }
  
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

用法:

      val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
      val sc = new SparkContext(conf)
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hello", "hadoop","hello"))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)

以上的实例就是wordCount的经典例子。
这里groupByKey就是一个简单的按照key分组的例子，里面不需要传参数，没有分区内分组聚合的功能，随着数据量的增加，shuffle过程中，可能会产生OOM。

3、reduceByKey算子

reduceByKey算子是多键值的分组聚合算子，它和groupByKey算子最大的区别是可以传入聚合函数，支持分组内聚合，会减少shuffle的数量。
源码：

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
   */
  def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
    reduceByKey(new HashPartitioner(numPartitions), func)
  }

  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

 /**
 1. Merge the values for each key using an associative and commutative reduce function. This will
 2. also perform the merging locally on each mapper before sending results to a reducer, similarly
 3. to a "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): 		    			
  		RDD[(K, V)] = self.withScope {
    	combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

reduceByKey的执行原理图如下，在进行shuffle之前，会按照传入的函数，先进行分区内聚合，在进行分区间聚合，注意我们传入的函数只有一个，所以分区间、分区内聚合的函数是一样的。这样做可以减少分区间shuffle的数量，相应的也减少了OOM的风险。
reduceByKey的返回值的类型是RDD[(K, V)]，K也就是我们的分区规则，V是我们分区聚合后的数据。

4、aggregateByKe算子

aggregateByKey算子类似于reduceByKey，都是分组聚合函数，但是他们之间的不同是

aggregateByKey算子是柯西化函数算子，通俗点讲就是该函数支持传入两个参数列表。aggregateByKey算子的第一个参数列表是初值，也就是分区内聚合的初始值。在reduceByKey中，默认是第一个数据就是初值，而aggregateByKey算子支持自定义初始值。
aggregateByKey支持分区间函数和分区内的聚合函数不同，也就是在第二列参数列表中，可以传入两个聚合函数。

源码如下:

  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }
  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, new HashPartitioner(numPartitions))(seqOp, combOp)
  }


  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }

入源码中，在调用aggregateByKey中，第一个参数列表(zeroValue: U, numPartitions: Int)，zeroValue就是初始值，numPartitions是分区数。
第二个参数列表是(seqOp: (U, V) => U, combOp: (U, U) => U)，
seqOp就是分区内聚合函数，combOp就是分区间聚合函数。
用法如下：

       val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
       val sc = new SparkContext(conf)

      // groupByKey
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)
      // reduceByKey
      rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
      // aggregateByKey
      rdd.map((_:String,1)).aggregateByKey(1,2)(_+_*2,_+_).foreach(println)

output:
(hello,6)
(spark,6)
(hadoop,9)

5、foldByKey算子

foldByKey算子取自aggregateByKey和reduceByKey，支持柯西化参数，类似于aggregateByKey，第一个参数列表是初始值，第二个参数列表是分区间及分区内聚合函数（这个分区检核分区内聚合函数一样，所以第二个参数列表只有一个参数）。
源码：


  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, new HashPartitioner(numPartitions))(func)
  }

  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    foldByKey(zeroValue, defaultPartitioner(self))(func)
  }

  /**
   * Merge the values for each key using an associative function and a neutral "zero value" which
   * may be added to the result an arbitrary number of times, and must not change the result
   * (e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).
   */
  def foldByKey(
      zeroValue: V,
      partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    // When deserializing, use a lazy val to create just one instance of the serializer per task
    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))

    val cleanedFunc = self.context.clean(func)
    combineByKeyWithClassTag[V]((v: V) => cleanedFunc(createZero(), v),
      cleanedFunc, cleanedFunc, partitioner)
  }

用法：

       val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
       val sc = new SparkContext(conf)

      // groupByKey
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)
      // reduceByKey
      rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
      // aggregateByKey
      rdd.map((_:String,1)).aggregateByKey(0,2)(_+_*2,_+_).foreach(println)
      // foldByKey
      rdd.map((_:String,1)).foldByKey(0,2)(_+_).foreach(println)

output:
(hello,2)
(spark,2)
(hadoop,3)

6、combineByKey算子

combineByKey算子，当计算时，发现数据结构不满足要求时，可以让第一个数据转换结构。分区内和分区间计算规则不相同。在进行聚合时候，需要制定数据结构。
我们姑且先看下爱源码:


  /**
 * Generic function to combine the elements for each key using a custom set of aggregation
 * functions. This method is here for backward compatibility. It does not provide combiner
 * classtag information to the shuffle.
 *  * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
  }

  /**
 * Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
 * This method is here for backward compatibility. It does not provide combiner
 * classtag information to the shuffle.
 *  * @see `combineByKeyWithClassTag`
   */
  def combineByKey[C](
      createCombiner: V => C,
      createCombiner: (C, V) => C,
      mergeCombiners: (C, C) => C,
      numPartitions: Int): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
  }

可以看到源码中，我们可以必须要传的参数有三个，createCombiner、createCombiner、mergeCombiners，选传：numPartitions（分区数量）。

createCombiner: V => C ，这个函数把当前的值作为参数，此时我们可以对其做些附加操作(类型转换)并把它返回 (这一步类似于初始化操作)
mergeValue: (C, V) => C，该函数把元素V合并到之前的元素C(createCombiner)上 (这个操作在每个分区内进行)
mergeCombiners: (C, C) => C，该函数把2个元素C合并 (这个操作在不同分区间进行)
该算子是比较核心的算子，因为其他关于分组聚合的算子底层都会用到它，所以为了让大家更明白，我们用求平均值的方式解释一下用法，会让大家更清楚。

       val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
       val sc = new SparkContext(conf)

      // groupByKey
      val rdd: RDD[String] = sc.makeRDD(List("spark", "hadoop","spark", "hadoop", "hello", "hadoop", "hello"))
      val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("jack", 85.0), ("jack", 82.0), ("Tom", 90.0)))
      rdd.map((_: String,1)).groupByKey().
        map((x: (String, Iterable[Int])) =>(x._1,x._2.sum))
        .foreach(println)
      // reduceByKey
      rdd.map((_:String,1)).reduceByKey((_: Int)+(_: Int)).foreach(println)
      // aggregateByKey
      rdd.map((_:String,1)).aggregateByKey(0,2)((_: Int)+(_: Int)*2,(_: Int)+(_: Int)).foreach(println)
      // foldByKey
      rdd.map((_:String,1)).foldByKey(0,2)((_: Int)+(_: Int)).foreach(println)
      // combineByKey
      rdd.map((_:String,1)).combineByKey((v: Int) => v, (x: Double, y: Int) => x + y, (x: Double, y: Double) => x + y)
    val value: RDD[(String, Double)] = rdd1.combineByKey(
      (score: Double) => (1, score),
      (c1: (Int, Double), new_values: Double) => (c1._1+1, c1._2 + new_values),
      (c1: (Int, Double), c2: (Int, Double)) => (c1._1 + c2._1, c1._2 + c2._2)
    ).map(x => {
      val x1: (String, (Int, Double)) = x
      (x1._1, x._2._2 / x._2._1)
    })
    value.foreach(println)

output:
		(jack,84.33333333333333)
		(Tom,89.33333333333333)

以上五个分组（聚合）函数，这里我们对四个算子进行一下总结：

算子	作用	相同点	不同点
groupByKey	分组	作用于键值对类型RDD	仅支持分组，没有聚合功能，所有的数据都会进行分区间分组，可能会产生大量的shuffle，造成OOM的风险较大。
reduceByKey	分组聚合	作用于键值对类型RDD，支持分区内、分区间聚合	主要的参数只有一个（聚合函数），没有柯西化，分区间聚合函数和分区内聚合函数一致，先采用该函数在shuffle之前进行分区内聚合，然后将结果在进行分区聚合，减少了shuffle数据量。第一个值作为初始值。底层是combineByKey。
aggregateByKey	分组聚合	作用于键聚合值对类型RDD，支持分区内、分区间聚合	支持柯西化参数，有两个参数列表，第一个参数列表是聚合初始值和分区数，第二个参数列表是两个聚合函数，他们分别是分区内聚合、分区间聚合函数。同reduceByKey先分区内聚合，在分区间聚合。主要区别是需要传入聚合初始值，解决特定场景下的聚合问题
foldByKey	分组聚合	作用于键值对类型RDD ，支持分区内、分区间聚合	同aggregateByKey算子，该算子参数支持柯西化，第一个参数列表是初始值以及分区数量，第二个参数列表是一个聚合函数，分区内聚合、分区间聚合函数一样。该算子夹杂在aggregateByKey和reduceByKey的作用之间。
combineByKey	分组聚合	作用于键值对类型RDD，支持分区内、分区间聚合	combineByKey算子是一个高级算子，不支持参数柯西化，主要有三个参数，createCombiner、createCombiner、mergeCombiners，createCombiner支持对数据进行特殊转换后聚合计算，我的理解是制作数据转换模型，相当于在统计前画好表格。 createCombiner参数可以将后面的元素进行聚合累加之类的，mergeCombiners合并参数2计算后的结果。

7、sortBy、sortByKey算子
这里将sortBy、sortByKey算子放在一起总结，主要是他们 ‘太像了’，sortBy的低层就是sortByKey。
sortByKey算子，见名知意，就是按照Key值进行排序,使用起来也是很简单的。还是先看一下源码。
源码：


  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }


  /**
   * Return this RDD sorted by the given key function.
   */
  def sortBy[K](
      f: (T) => K,
      ascending: Boolean = true,
      numPartitions: Int = this.partitions.length)
      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
    this.keyBy[K](f)
        .sortByKey(ascending, numPartitions)
        .values
  }

可以看到sortByKey算子参数就是排序规则和分区数，主要作用于键值对型RDD，根据Key排序，sortBy获取排序标注值，然后再sortByKey排序，都是简单的算子。
这里先做一下简单的用法演示：

    // TODO 创建执行环境
    val conf = new SparkConf().setMaster("local[*]").setAppName("create")
    val sc = new SparkContext(conf)

    val rdd: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("Toni", 85.0), ("jack", 82.0), ("Tom", 93.0)),2)
    rdd.sortByKey(ascending = true).foreach(println)
    rdd.sortBy(_._1,ascending = false).foreach(println)

output:
// sortByKey
(Tom,88.0)
(Toni,85.0)
(Tom,90.0)
(jack,86.0)
(Tom,93.0)
(jack,82.0)
// sortBy
(Tom,88.0)
(Tom,90.0)
(Tom,93.0)
(jack,86.0)
(jack,82.0)
(Toni,85.0)

现在总结一下两个算子：

sortBy：作用于一般RDD，使用时需要指定排序特征，也就是按照哪个值进行排序，使用排序更灵活，可指定排序顺序、与分区。
sortByKey：仅用于二元组排序，根据键值排序，也就是二元组的第一个值排序。相对于sortBy灵活读不高，但是使用更加方便，可指定排序顺序、与分区。

8、cogroup、join、cartesian算子

cogroup、join、cartesian三个算子均是RDD合并算子，主要用途了及时多个RDD的合并。

cogroup
cogroup算子可以同时合并多个RDD，它的合并原则是按照key值，有多少个RDD参与，就会有多少个CompactBuffer()，同一个RDD中的元素，都将在同一个CompactBuffer里面，不同的RDD元素，在不同的CompactBuffer里。

源码如下：


  /**
   * For each key k in `this` or `other1` or `other2` or `other3`,
   * return a resulting RDD that contains a tuple with the list of values
   * for that key in `this`, `other1`, `other2` and `other3`.
   */
  def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
      other2: RDD[(K, W2)],
      other3: RDD[(K, W3)],
      partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
    cg.mapValues { case Array(vs, w1s, w2s, w3s) =>
       (vs.asInstanceOf[Iterable[V]],
         w1s.asInstanceOf[Iterable[W1]],
         w2s.asInstanceOf[Iterable[W2]],
         w3s.asInstanceOf[Iterable[W3]])
    }
  }

join
join算子仅支持两个RDD之间的合并，他的底层实现就是cogroup，在cogroup的结果上进行加工。我们先看源码


  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

在cogroup的基础上，进行两个元组遍历返回。当一个元组长度为0时，整个键值都不会被合并。如：
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) 。所以jion算子就是当两个RDD共同的键值才会被jion，不共同存在的键值会被去掉。相当于mysql的内来接。当然，我们用leftOuterJoin、rightOuterJoin也相似于mysql的左外和右外连接。

cartesian
cartesian是笛卡尔积，不会有Key的拼接，简单描述一下：就是两个RDD元素互相两两连接。直接看源码更清晰：


  /**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
    new CartesianRDD(sc, this, other)
  }


private[spark]
class CartesianRDD[T: ClassTag, U: ClassTag](
    sc: SparkContext,
    var rdd1 : RDD[T],
    var rdd2 : RDD[U])
  extends RDD[(T, U)](sc, Nil)
  with Serializable {

  val numPartitionsInRdd2 = rdd2.partitions.length

  override def getPartitions: Array[Partition] = {
    // create the cross product split
    val array = new Array[Partition](rdd1.partitions.length * rdd2.partitions.length)
    for (s1 <- rdd1.partitions; s2 <- rdd2.partitions) {
      val idx = s1.index * numPartitionsInRdd2 + s2.index
      array(idx) = new CartesianPartition(idx, rdd1, rdd2, s1.index, s2.index)
    }
    array
  }

  override def getPreferredLocations(split: Partition): Seq[String] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    (rdd1.preferredLocations(currSplit.s1) ++ rdd2.preferredLocations(currSplit.s2)).distinct
  }

  override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    for (x <- rdd1.iterator(currSplit.s1, context);
         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }

  override def getDependencies: Seq[Dependency[_]] = List(
    new NarrowDependency(rdd1) {
      def getParents(id: Int): Seq[Int] = List(id / numPartitionsInRdd2)
    },
    new NarrowDependency(rdd2) {
      def getParents(id: Int): Seq[Int] = List(id % numPartitionsInRdd2)
    }
  )

  override def clearDependencies(): Unit = {
    super.clearDependencies()
    rdd1 = null
    rdd2 = null
  }
}

关键部分是:

  override def compute(split: Partition, context: TaskContext): Iterator[(T, U)] = {
    val currSplit = split.asInstanceOf[CartesianPartition]
    for (x <- rdd1.iterator(currSplit.s1, context);
         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
  }

用法：

    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
    val sc = new SparkContext(conf)

    val rdd1: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 88.0), ("Tom", 90.0), ("jack", 86.0), ("Toni", 85.0), ("jack", 82.0), ("Tom", 93.0)))
    val rdd2: RDD[(String, Double)] = sc.makeRDD(List(("Tom", 85.0), ("jack", 82.0), ("Toni", 93.0)))
    val value: RDD[(String, (Iterable[Double], Iterable[Double]))] = rdd1.cogroup(rdd2)
    value.foreach(println)
            //    (Toni,(CompactBuffer(85.0),CompactBuffer(93.0)))
            //    (Tom,(CompactBuffer(88.0, 90.0, 93.0),CompactBuffer(85.0)))
            //    (jack,(CompactBuffer(86.0, 82.0),CompactBuffer(82.0)))
    val value1: RDD[(String, (Double, Double))] = rdd1.join(rdd2)
    value1.foreach(println)
            //    (jack,(86.0,82.0))
            //    (Toni,(85.0,93.0))
            //    (jack,(82.0,82.0))
            //    (Tom,(88.0,85.0))
            //    (Tom,(90.0,85.0))
            //    (Tom,(93.0,85.0))
    val value2: RDD[((String, Double), (String, Double))] = rdd1.cartesian(rdd2)
    value2.foreach(println)
          //    ((Tom,88.0),(Tom,85.0))
          //    ((Tom,88.0),(jack,82.0))
          //    ((Tom,88.0),(Toni,93.0))
          //    ((Tom,90.0),(Tom,85.0))
          //    ((Tom,90.0),(jack,82.0))
          //    ((Tom,90.0),(Toni,93.0))
          //    ((jack,86.0),(Tom,85.0))
          //    ((jack,86.0),(jack,82.0))
          //    ((jack,86.0),(Toni,93.0))

9、pipe算子

pipe算子是让RDD执行一下外部函数，参数是传入命令或者外部程序路径即可。官网的解释是：Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings。
该算子比较生僻，我也没有用过。理解不深入，等之后深入理解了在补充完整源码和用法。

10、coalesce、repartition、repartitionAndSortWithinPartitions算子

coalesce、repartition、repartitionAndSortWithinPartitions三个算子的作用都是针对分区的操作，;coalesce、repartition算子就是单纯的重分区操作，但是repartitionAndSortWithinPartitions还具有分组、分区的功能。
因为这三个算子原理都是比较简单的，那么我们先看一下源码：

coalesce：


  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

repartition：


  /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartitionAndSortWithinPartitions：

  def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
    new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering)
  }

可以看到repartition的底层就是coalesce参数为true的情况。
那我们介绍一下他们的使用场景：

假设RDD有N个分区，需要重新划分成M个分区：

    N < M: 一般情况下N个分区有数据分布不均匀的状况，利用HashPartitioner函数将数据重新分区为M个，这时需要将shuffle设置为true。因为重分区前后相当于宽依赖，会发生shuffle过程，此时可以使用coalesce(shuffle=true)，或者直接使用repartition()。

    N > M：并且N和M相差不多(假如N是1000，M是100): 那么就可以将N个分区中的若干个分区合并成一个新的分区，最终合并为M个分区，这是前后是窄依赖关系，可以使用coalesce(shuffle=false)。

    N > M：并且两者相差悬殊: 这时如果将shuffle设置为false，父子ＲＤＤ是窄依赖关系，他们同处在一个Ｓｔａｇｅ中，就可能造成spark程序的并行度不够，从而影响性能，如果在M为1的时候，为了使coalesce之前的操作有更好的并行度，可以将shuffle设置为true。

总结
如果传入的参数大于现有的分区数目，而shuffle为false，RDD的分区数不变，也就是说不经过shuffle，是无法将RDDde分区数变多的。(该部分来源于：Spark学习-Coalesce()方法和rePartition()方法)

11、mapValues算子

mapValues是标准的键值对算子，他和map很像，都是非shuffle的一对一依赖算子，唯一的区别是mapValues作用于键值对的values值算子。源码和用法都是非常简单的，这里我们仅展示一下源码：


  /**
   * Pass each value in the key-value pair RDD through a map function without changing the keys;
   * this also retains the original RDD's partitioning.
   */
  def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
      preservesPartitioning = true)
  }

在(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },中我们可以看到，迭代器经过map后，k是没有发生变化的，只是v在我们的经过我们的函数作用变成了新的值。

12、flatMapValues算子

flatMapValues是键值对算子，它和flatMap很像，针对值做扁平化操作，通过我们传入的函数，键值不变，生成一系列的同键值的键值对。
看下源码:


  /**
   * Pass each value in the key-value pair RDD through a flatMap function without changing the
   * keys; this also retains the original RDD's partitioning.
   */
  def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)] = self.withScope {
    val cleanF = self.context.clean(f)
    new MapPartitionsRDD[(K, U), (K, V)](self,
      (context, pid, iter) => iter.flatMap { case (k, v) =>
        cleanF(v).map(x => (k, x))
      },
      preservesPartitioning = true)
  }

用法：

    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("BOKE")
    val sc = new SparkContext(conf)
    val rdd2: RDD[(String, Int)] = sc.makeRDD(List(("Tom", 1), ("Tom", 2), ("jack", 3)))
    rdd2.flatMapValues(_ to 5).foreach(println)
          //    (jack,3)
          //    (Tom,1)
          //    (Tom,2)
          //    (Tom,2)
          //    (jack,4)
          //    (Tom,3)
          //    (Tom,3)
          //    (Tom,4)
          //    (jack,5)
          //    (Tom,5)
          //    (Tom,4)
          //    (Tom,5)

13、keys、values算子

keys、values算子就是返回键值对的键值或者values值。形成一个新的RDD[Any]
用法和源码特别简单，就省略了。

对于transformations算子，暂时说到这里，后边随着知识的积累，逐步补充完善，后边我们会开始action算子，拜了个拜。

你可能感兴趣的:(spark,spark,big,data,hadoop)

Android ViewBinding 使用与封装教程积跬步DEV Android 开发实战大全 android
AndroidViewBinding使用与封装教程：一、ViewBinding是什么？核心功能：为每个XML布局文件自动生成一个绑定类（如ActivityMainBinding），直接暴露所有带ID的视图引用。优点：避免繁琐的findViewById()，类型安全且编译时检查。对比DataBinding：ViewBinding仅处理视图引用，无数据绑定功能。DataBinding支持双向数据绑定，
Python 脚本最佳实践2025版
前文可以直接把这篇文章喂给AI,可以放到AI角色设定里,也可以直接作为提示词.这样,你只管提需求,写脚本就让AI来.概述追求简洁和清晰：脚本应简单明了。使用函数(functions)、常量(constants)和适当的导入(import)实践来有逻辑地组织你的Python脚本。使用枚举(enumerations)和数据类(dataclasses)等数据结构高效管理脚本状态。通过命令行参数增强交互性
CentOS7环境卸载MySQL5.7 Hadoop_Liang mysql 数据库 mysql
备份重要数据切记，卸载之前先备份mysql重要的数据。备份一个数据库例如：备份名为mydatabase的数据库到backup.sql的文件中mysqldump-uroot-ppassword123mydatabase>backup.sql备份所有数据库mysqldump-uroot-ppassword123--all-databases>all_databases_backup.sql注意：-p后
“Datawhale AI夏令营”基于带货视频评论的用户洞察挑战赛 fzyz123 Datawhale AI夏令营人工智能 Datawhale 大模型技术 NLP 深度学习 AI夏令营
前言：本次是DatawhaleAI夏令营2025年第一期的内容，赛事是：基于带货视频评论的用户洞察挑战赛（科大讯飞AI大赛）一、赛事背景在直播电商爆发式增长浪潮中，短视频平台积累的海量带货视频及用户评论数据蕴含巨大商业价值。这些数据不仅是消费者体验的直接反馈，更是驱动品牌决策的关键资产。用户洞察的核心在于视频内容与评论数据的联合挖掘：通过智能识别推广商品分析评论中的情感表达与观点聚合精准捕捉消费者
老系统改造增加初始化，自动化数据源配置（tomcat+jsp+springmvc）
老系统改造增加初始化，自动化数据源配置一、前言二、改造描述1、环境说明2、实现步骤简要思考三、开始改造1、准备sql初始化文件2、启动时自动读取jdbc文件，创建数据源，如未配置，需要一个默认的临时数据源2.1去掉spingmvc原本配置的固定dataSource，改为动态dataSource2.2代码类，这里是示例，我就不管规范了，放到一起2.2.1DynamicDataSourceConfig
redis中什么是bigkey？会有什么影响？ Vic2334 redis
什么是bigkey？会有什么影响？bigkey是指key对应的value所占的内存空间比较大，例如一个字符串类型的value可以最大存到512MB，一个列表类型的value最多可以存储23-1个元素。如果按照数据结构来细分的话，一般分为字符串类型bigkey和非字符串类型bigkey。字符串类型：体现在单个value值很大，一般认为超过10KB就是bigkey，但这个值和具体的OPS相关。非字符串
如何发现 Redis 中的 BigKey？ sevevty-seven redis bootstrap 数据库
如何发现Redis中的BigKey？Redis因其出色的性能，常被用作缓存、消息队列和会话存储。然而，在Redis的使用过程中，BigKey是一个不容忽视的问题。BigKey指的是存储了大量数据或包含大量成员的键。它们不仅会占用大量内存，还可能导致网络延迟、主从同步延迟，甚至在极端情况下引发Redis服务崩溃。因此，有效地发现和处理BigKey对于维护Redis服务的稳定性和性能至关重要。本文将深
ETL可视化工具 DataX -- 简介( 一) dazhong2012 软件工具数据仓库 datax ETL
引言DataX系列文章：ETL可视化工具DataX–安装部署(二)ETL可视化工具DataX–DataX-Web安装(三)1.1DataX1.1.1DataX概览DataX是阿里云DataWorks数据集成的开源版本，在阿里巴巴集团内被广泛使用的离线数据同步工具/平台。DataX实现了包括MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、
TCP和UDP协议区别+应用场景+优缺点+常用协议马拉萨的春天一天一读基础知识点 tcp/ip udp 网络
文章目录1.TCP协议特点应用场景优点缺点运行于TCP协议之上的协议2.UDP协议特点应用场景优点缺点运行于UDP协议之上的协议TCP（TransmissionControlProtocol）和UDP（UserDatagramProtocol）是两种常用的传输层协议，它们在网络通信中扮演不同的角色，各有优缺点。1.TCP协议特点提供面向连接的、可靠的数据传输服务。使用三次握手建立连接，四次挥手断开
将多个小型YOLO数据集合并为一个大型数据集梦实学习室 YOLO python YOLO python 机器学习
一、将多个小型YOLO数据集合并为一个大型数据集importosimportshutilimportargparsedefmerge_data(source_dir,target_dir,images_dir,labels_dir):images_target=os.path.join(target_dir,images_dir)labels_target=os.path.join(target_
[论文阅读]Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smal 0x211 论文阅读语言模型人工智能自然语言处理
中文译名：逐步蒸馏！以较少的训练数据和较小的模型规模超越较大的语言模型发布链接：http://arxiv.org/abs/2305.02301AcceptedtoFindingsofACL2023阅读原因：近期任务需要用到蒸馏操作，了解相关知识核心思想：改变视角。原来的视角：把LLMs视为噪声标签的来源。现在的视角：把LLMs视为能够推理的代理。方法好在哪？需要的数据量少，得到的结果好。文章的方法
MySQL数据库访问（C/C++）敲上瘾 MySQL数据库 mysql 数据库 c++c语言数据库开发数据库架构
访问数据库的方式：命令行：使用命令行输入SQL指令直接访问。需记忆命令和SQL语法，对新手不友好。正因如此推荐新手使用该方式访问，能倒逼学习者对SQL语法的记忆，并对MySQL更深入理解。图形化界面访问：使用图形化界面工具，如：DBeaver、DataGrip、Navicat、HeidiSQL（MySQL）、MySQLWorkbench。特点：有语法提示，可以直接对数据手动增删改。编程接口：在编写
Flink 2.0 DataStream算子全景 Edingbrugh.南空大数据 flink flink 人工智能
在实时流处理中，ApacheFlink的DataStreamAPI算子是构建流处理pipeline的基础单元。本文基于Flink2.0，聚焦算子的核心概念、分类及高级特性。一、算子核心概念：流处理的"原子操作1.数据流拓扑（StreamTopology）每个Flink应用可抽象为有向无环图（DAG），由源节点（Source）、算子节点（Operator）和汇节点（Sink）构成，算子通过数据流（S
Flink DataStream API详解（一） bxlj_jcj Flink flink 大数据
一、引言Flink的DataStreamAPI，在流处理领域大显身手的核心武器。在很多实时数据处理场景中，如电商平台实时分析用户购物行为以实现精准推荐，金融领域实时监控交易数据以防范风险，DataStreamAPI都发挥着关键作用，能够对源源不断的数据流进行高效处理和分析。接下来，就让我们一起深入探索FlinkDataStreamAPI。二、DataStream编程基础搭建在开始使用FlinkDa
Flink DataStream API详解（二）
一、引言咱两书接上回，上一篇文章主要介绍了DataStreamAPI一些基本的使用，主要是针对单数据流的场景下，但是在实际的流处理场景中，常常需要对多个数据流进行合并、拆分等操作，以满足复杂的业务需求。Flink的DataStreamAPI提供了一系列强大的多流转换算子，如union、connect和split等，下面我们来详细了解一下它们的功能和用法。二、多流转换2.1union算子union算
docker常见问题解决方法小王聊技术 docker
目录迁移至其他服务器清理Docker占用的磁盘空间常见问题：迁移至其他服务器1.将docker容器导出dockerexport-o保存路径/xxx.tar容器id2.将容器tar远程拷贝到新的服务器(从新的服务器上向老服务器上请求复制)scproot@服务器地址:/data/xxx.tar/root3.将导入的tar包转为镜像dockerimport-cxxx.tarimage_name:tag
vue如何实现Cascader 级联选择器(二级全部选中只展示一级，三级全部选中只展示二级) 小周同学: vue vue.js
select提交重置级联exportdefault{data(){return{ruleForm:{selectLabel:[],idList:[],},citiesList:[],rules:{selectLabel:[{type:'array',required:true,message:'多选不能为空',trigger:'change'}],},props:{multiple:true,va
Redis第五讲：详解 Redis 中 BigKey、HotKey 的发现与处理程序员 jet_qi 深入理解数据库 redis 数据库缓存大key 热点key
简介：在Redis的使用过程中，我们经常会遇到BigKey（下文将其称为“大key”）及HotKey（下文将其称为“热key”）。大Key与热Key如果未能及时发现并进行处理，很可能会使服务性能下降、用户体验变差，甚至引发大面积故障。本文详解Redis中BigKey、HotKey的发现与处理。文章目录1、大Key与热Key的定义1.1、什么是大Key1.2、什么是热Key2、大Key与热Key带来
如何发现Redis中的bigkey？代码中の快捷键 redis 数据库缓存
如何发现Redis中的bigkey？我主要用这几个方法：redis-cli--bigkeys(最常用，最省事)：直接在命令行敲这个命令：redis-cli-h你的redis地址-p端口--bigkeys作用：它会自动扫描整个数据库。结果：告诉你每种数据类型（String,Hash,List,Set,ZSet）里最大的那个key是什么，有多大（比如String多大，List有多少元素）。优点：简单、
pandas销售数据分析
pandas销售数据分析数据保存在data目录消费者数据：customers.csv商品数据：products.csv交易数据：transactions.csvcustomers.csv数据结构：字段描述customer_id客户IDgender性别age年龄region地区membership_date会员日期products.csv数据结构：字段描述product_id产品IDcategory
MyBatis-Plus 使用wrapper自定义SQL
MyBatis-Plus使用wrapper自定义SQL，以下是单表查询。官方文档官方的例子：//mapper接口@Select("select*frommysql_data${ew.customSqlSegment}")ListgetAll(@Param(Constants.WRAPPER)Wrapperwrapper);//xmlListgetAll(Wrapperew);SELECT*FROM
Mysql数据库可以使用命令行msyql -u root -p连接，但是Navicat连不上 2501_92753117 数据库 mysql
1.Mysql服务启动1.1输入命令回车输入密码可以正常连接msyql-uroot-p1.1.2Navicat连不上2.解决方案2.1连接mysqlmsyql-uroot-p1.2.2查询所有数据库showdatabases;1.2.3切换到mysql数据库usemysql;1.2.4查询hostSELECThost,userFROMuserWHEREuser='root';1.2.5更新任意ip
Hive简介
文章目录Hive简介Hive特点Hive和RDBMS的对比Hive的架构Hive的数据组织Hive数据类型Hive简介1、Hive由Facebook实现并开源2、是基于Hadoop的一个数据仓库工具3、可以将结构化的数据映射为一张数据库表4、并提供HQL(HiveSQL)查询功能5、底层数据是存储在HDFS上6、Hive的本质是将SQL语句转换为MapReduce任务运行7、使不熟悉MapRedu
python把竖着的变成横着的数_python – Reportlab：如何切换纵向和横向？ weixin_39524703
我正在使用reportlab从动态数据自动生成pdf报告.由于内容有时太大,无法以纵向显示,所以我正在为大量内容切换到景观.以下是我的报告生成工作原理：主功能：doc=DocTemplate(...)//DoctemplateisacustomedBaseDocTemplateclassarray=[]some_data="Hereissomedatadisplayedinportrait"arr
shell脚本实现Hive库表迁移 docsz hive Linux shell
1、获取hive所有库的建表语句#获取hive所有库的建表语句#!/bin/bashmkdir-p~/hive/tables/tablesDDL#获取库名hive-e"showdatabases;">~/hive/databases.txtsed-i'1,3d'~/hive/databases.txtsed-i'$d'~/hive/databases.txtcat~/hive/databases.
数据分析框架和方法 XiaoQiong.Zhang 人工智能
一、核心分析框架(TheBigPictureFrameworks)描述性分析(WhatHappened?)目的：了解过去发生了什么，描述现状，监控业务健康。核心工作：汇总、聚合、计算基础指标(KPI)，生成报表和仪表盘。常用方法/指标：计数/求和/平均值/中位数：DAU/MAU，总销售额，客单价等。比率：转化率，点击率，流失率，毛利率等。分布：用户活跃度分布、订单金额分布、地域分布等。常用于理解群
python基于Hadoop的NBA球员大数据分析与可视化系统
目录技术栈介绍具体实现截图系统设计研究方法：设计步骤设计流程核心代码部分展示研究方法详细视频演示试验方案论文大纲源码获取/详细视频演示技术栈介绍Django-SpringBoot-php-Node.js-flask本课题的研究方法和研究步骤基本合理，难度适中，本选题是学生所学专业知识的延续，符合学生专业发展方向，对于提高学生的基本知识和技能以及钻研能力有益。该学生能够在预定时间内完成该课题的设计。
大数据技术之集群数据迁移
dfs.namenode.rpc-address.nameservice1.namenode30hadoop104:8020dfs.namenode.rpc-address.nameservice1.namenode37hadoop106:8020dfs.namenode.http-address.nameservice1.namenode30hadoop104:9870dfs.namenode.
HIVE（二） 2301_78012738 hive 数据仓库
目录访问HIVE的三种方式DDLDML数据操作向表中装载数据数据导出常用函数Like和RLike分组Join排序分区表和分桶表访问HIVE的三种方式启动Hive命令，CtrlC退出客户端，执行测试语句，与sql一致[wyc@hadoop102hive]$bin/hive经验小结：在hive中执行语句报错：ExecutionError,returncode2fromorg.apache.hadoop
初学者关于自定义类型结构体的学习笔记近津薪荼学习笔记数据结构
1.结构的特殊声明//匿名结构体类型struct{inta;charb;floatc;}x;struct{inta;charb;floatc;}a[20],*p;p=&x;不可取，本质上是两个不同类型的结构体上述代码的声明方式，该结构体类型，如果不重命名的话，只能用一次（声明时顺便创建变量）2.结构体的自引用structNode{intdata;structNodenext;};上述代码，结构体中
HttpClient 4.3与4.3版本以下版本比较 spjich java httpclient
网上利用java发送http请求的代码很多，一搜一大把，有的利用的是java.net.*下的HttpURLConnection，有的用httpclient，而且发送的代码也分门别类。今天我们主要来说的是利用httpclient发送请求。 httpclient又可分为 httpclient3.x httpclient4.x到httpclient4.3以下 httpclient4.3
Essential Studio Enterprise Edition 2015 v1新功能体验 Axiba .net
概述：Essential Studio已全线升级至2015 v1版本了！新版本为JavaScript和ASP.NET MVC添加了新的文件资源管理器控件，还有其他一些控件功能升级，精彩不容错过，让我们一起来看看吧！ syncfusion公司是世界领先的Windows开发组件提供商，该公司正式对外发布Essential Studio Enterprise Edition 2015 v1版本。新版本
[宇宙与天文]微波背景辐射值与地球温度 comsci 背景
宇宙这个庞大,无边无际的空间是否存在某种确定的,变化的温度呢? 如果宇宙微波背景辐射值是表示宇宙空间温度的参数之一,那么测量这些数值,并观测周围的恒星能量输出值,我们是否获得地球的长期气候变化的情况呢? &nbs
lvs-server 男人50 server
#!/bin/bash # # LVS script for VS/DR # #./etc/rc.d/init.d/functions # VIP=10.10.6.252 RIP1=10.10.6.101 RIP2=10.10.6.13 PORT=80 case $1 in start) /sbin/ifconfig eth2:0 $VIP broadca
java的WebCollector爬虫框架 oloz 爬虫
WebCollector主页： https://github.com/CrawlScript/WebCollector 下载：webcollector-版本号-bin.zip将解压后文件夹中的所有jar包添加到工程既可。接下来看demo package org.spider.myspider; import cn.edu.hfut.dmic.webcollector.cra
jQuery append 与 after 的区别小猪猪08
1、after函数定义和用法： after() 方法在被选元素后插入指定的内容。语法： $(selector).after(content) 实例： <html> <head> <script type="text/javascript" src="/jquery/jquery.js"></scr
mysql知识充电香水浓 mysql
索引索引是在存储引擎中实现的，因此每种存储引擎的索引都不一定完全相同，并且每种存储引擎也不一定支持所有索引类型。根据存储引擎定义每个表的最大索引数和最大索引长度。所有存储引擎支持每个表至少16个索引，总索引长度至少为256字节。大多数存储引擎有更高的限制。MYSQL中索引的存储类型有两种：BTREE和HASH，具体和表的存储引擎相关； MYISAM和InnoDB存储引擎
我的架构经验系列文章索引 agevs 架构
下面是一些个人架构上的总结，本来想只在公司内部进行共享的，因此内容写的口语化一点，也没什么图示，所有内容没有查任何资料是脑子里面的东西吐出来的因此可能会不准确不全，希望抛砖引玉，大家互相讨论。要注意，我这些文章是一个总体的架构经验不针对具体的语言和平台，因此也不一定是适用所有的语言和平台的。（内容是前几天写的，现附上索引）前端架构 http://www.
Android so lib库远程http下载和动态注册 aijuans andorid
一、背景在开发Android应用程序的实现，有时候需要引入第三方so lib库，但第三方so库比较大，例如开源第三方播放组件ffmpeg库, 如果直接打包的apk包里面, 整个应用程序会大很多.经过查阅资料和实验，发现通过远程下载so文件，然后再动态注册so文件时可行的。主要需要解决下载so文件存放位置以及文件读写权限问题。二、主要
linux中svn配置出错 conf/svnserve.conf:12: Option expected 解决方法 baalwolf option
在客户端访问subversion版本库时出现这个错误： svnserve.conf:12: Option expected 为什么会出现这个错误呢，就是因为subversion读取配置文件svnserve.conf时，无法识别有前置空格的配置文件，如### This file controls the configuration of the svnserve daemon, if you##
MongoDB的连接池和连接管理 BigCat2013 mongodb
在关系型数据库中，我们总是需要关闭使用的数据库连接，不然大量的创建连接会导致资源的浪费甚至于数据库宕机。这篇文章主要想解释一下mongoDB的连接池以及连接管理机制，如果正对此有疑惑的朋友可以看一下。通常我们习惯于new 一个connection并且通常在finally语句中调用connection的close()方法将其关闭。正巧，mongoDB中当我们new一个Mongo的时候，会发现它也
AngularJS使用Socket.IO bijian1013 JavaScript AngularJS Socket.IO
目前，web应用普遍被要求是实时web应用，即服务端的数据更新之后，应用能立即更新。以前使用的技术（例如polling）存在一些局限性，而且有时我们需要在客户端打开一个socket，然后进行通信。 Socket.IO(http://socket.io/)是一个非常优秀的库，它可以帮你实
[Maven学习笔记四]Maven依赖特性 bit1129 maven
三个模块为了说明问题，以用户登陆小web应用为例。通常一个web应用分为三个模块，模型和数据持久化层user-core, 业务逻辑层user-service以及web展现层user-web， user-service依赖于user-core user-web依赖于user-core和user-service 依赖作用范围 Maven的dependency定义
【Akka一】Akka入门 bit1129 akka
什么是Akka Message-Driven Runtime is the Foundation to Reactive Applications In Akka, your business logic is driven through message-based communication patterns that are independent of physical locatio
zabbix_api之perl语言写法 ronin47 zabbix_api之perl
zabbix_api网上比较多的写法是python或curl。上次我用java－－http://bossr.iteye.com/blog/2195679，这次用perl。for example: #!/usr/bin/perl use 5.010 ; use strict ; use warnings ; use JSON :: RPC :: Client ; use
比优衣库跟牛掰的视频流出了，兄弟连Linux运维工程师课堂实录，更加刺激，更加实在！ brotherlamp linux运维工程师 linux运维工程师教程 linux运维工程师视频 linux运维工程师资料 linux运维工程师自学
比优衣库跟牛掰的视频流出了，兄弟连Linux运维工程师课堂实录，更加刺激，更加实在！ ----------------------------------------------------- 兄弟连Linux运维工程师课堂实录-计算机基础-1-课程体系介绍1 链接：http://pan.baidu.com/s/1i3GQtGL 密码：bl65 兄弟连Lin
bitmap求哈密顿距离-给定N（1<=N<=100000）个五维的点A(x1,x2,x3,x4,x5)，求两个点X(x1,x2,x3,x4,x5)和Y( bylijinnan java
import java.util.Random; /** * 题目： * 给定N（1<=N<=100000）个五维的点A(x1,x2,x3,x4,x5)，求两个点X(x1,x2,x3,x4,x5)和Y(y1,y2,y3,y4,y5)， * 使得他们的哈密顿距离（d=|x1-y1| + |x2-y2| + |x3-y3| + |x4-y4| + |x5-y5|）最大
map的三种遍历方法 chicony map
package com.test; import java.util.Collection; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.Set; public class TestMap { public static v
Linux安装mysql的一些坑 chenchao051 linux
1、mysql不建议在root用户下运行 2、出现服务启动不了，111错误，注意要用chown来赋予权限，我在root用户下装的mysql，我就把usr/share/mysql/mysql.server复制到/etc/init.d/mysqld, (同时把my-huge.cnf复制/etc/my.cnf) chown -R cc /etc/init.d/mysql
Sublime Text 3 配置 daizj 配置 Sublime Text
Sublime Text 3 配置解释(默认){// 设置主题文件“color_scheme”: “Packages/Color Scheme – Default/Monokai.tmTheme”,// 设置字体和大小“font_face”: “Consolas”,“font_size”: 12,// 字体选项：no_bold不显示粗体字，no_italic不显示斜体字，no_antialias和
MySQL server has gone away 问题的解决方法 dcj3sjt126com SQL Server
MySQL server has gone away 问题解决方法，需要的朋友可以参考下。应用程序（比如PHP）长时间的执行批量的MYSQL语句。执行一个SQL，但SQL语句过大或者语句中含有BLOB或者longblob字段。比如，图片数据的处理。都容易引起MySQL server has gone away。今天遇到类似的情景，MySQL只是冷冷的说：MySQL server h
javascript/dom:固定居中效果 dcj3sjt126com JavaScript
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&
使用 Spring 2.5 注释驱动的 IoC 功能 e200702084 spring bean 配置管理 IOC Office
使用 Spring 2.5 注释驱动的 IoC 功能 developerWorks 文档选项将打印机的版面设置成横向打印模式打印本页将此页作为电子邮件发送将此页作为电子邮件发送级别：初级陈雄华 ([email protected]), 技术总监, 宝宝淘网络科技有限公司 2008 年 2 月 28 日 &nb
MongoDB常用操作命令 geeksun mongodb
1. 基本操作 db.AddUser(username,password) 添加用户 db.auth(usrename,password) 设置数据库连接验证 db.cloneDataBase(fromhost)
php写守护进程（Daemon） hongtoushizi PHP
转载自： http://blog.csdn.net/tengzhaorong/article/details/9764655 守护进程（Daemon）是运行在后台的一种特殊进程。它独立于控制终端并且周期性地执行某种任务或等待处理某些发生的事件。守护进程是一种很有用的进程。php也可以实现守护进程的功能。 1、基本概念 &nbs
spring整合mybatis,关于注入Dao对象出错问题 jonsvien DAO spring bean mybatis prototype
今天在公司测试功能时发现一问题：先进行代码说明： 1，controller配置了Scope="prototype"（表明每一次请求都是原子型） @resource/@autowired service对象都可以（两种注解都可以）。 2，service 配置了Scope="prototype"（表明每一次请求都是原子型）
对象关系行为模式之标识映射 home198979 PHP 架构企业应用对象关系标识映射
HELLO!架构一、概念 identity Map:通过在映射中保存每个已经加载的对象，确保每个对象只加载一次，当要访问对象的时候，通过映射来查找它们。其实在数据源架构模式之数据映射器代码中有提及到标识映射，Mapper类的getFromMap方法就是实现标识映射的实现。二、为什么要使用标识映射？在数据源架构模式之数据映射器中 //c
Linux下hosts文件详解 pda158 linux
　1、主机名：　　无论在局域网还是INTERNET上，每台主机都有一个IP地址，是为了区分此台主机和彼台主机，也就是说IP地址就是主机的门牌号。　　公网：IP地址不方便记忆，所以又有了域名。域名只是在公网（INtERNET)中存在，每个域名都对应一个IP地址，但一个IP地址可有对应多个域名。　　局域网：每台机器都有一个主机名，用于主机与主机之间的便于区分，就可以为每台机器设置主机
nginx配置文件粗解 spjich java nginx
#运行用户#user nobody;#启动进程,通常设置成和cpu的数量相等worker_processes 2;#全局错误日志及PID文件#error_log logs/error.log;#error_log logs/error.log notice;#error_log logs/error.log inf
数学函数 w54653520 java
public class S { // 传入两个整数，进行比较，返回两个数中的最大值的方法。 public int get( int num1, int nu