Spark编程基础(Scala版)——RDD编程

1、RDD编程基础

1.1 RDD创建

Spark采用textFile()方法来从文件系统中加载数据创建RDD

val  lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
val  lines = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt")

1.2 RDD操作

转换操作

  • filter(func) 筛选出满足func的元素,返回一个新的数据集
val  lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
val  linesWithSpark=lines.filter(line => line.contains("Spark"))
  • map(func)操作每个元素传递到函数func中,并返回一个新的数据集
data=Array(1,2,3,4,5)
val  rdd1= sc.parallelize(data)
val  rdd2=rdd1.map(x=>x+10)
  • flatMap(func) 第一步,map;第二步,flat
flatMap示意图
  • groupByKey() 应用于KV键值对数据集,返回(K, Iterable)形式的数据集
groupByKey示意图
  • reduceByKey(func) 应用于KV键值对数据集1,返回一个新的(K,V)数据集,其中的V是将每个key传递给func聚合后的结果
words.reduceByKey((a,b) => a+b)
reduce操作

reduceByKey

行动操作

行动操作
scala>rdd=sc.parallelize(Array(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int]=ParallelCollectionRDD[1] at parallelize at :24
scala> rdd.count()
res0: Long = 5
scala> rdd.first()
res1: Int = 1
scala> rdd.take(3)
res2: Array[Int] = Array(1,2,3)
scala> rdd.reduce((a,b)=>a+b)
res3: Int = 15
scala> rdd.collect()
res4: Array[Int] = Array(1,2,3,4,5)
scala> rdd.foreach(elem=>println(elem))
1
2
3
4
5

1.3 持久化

spark每次遇到行动操作,都会从头开始执行计算

  • 可以通过持久化机制来避免重复计算的开销。使用persist()方法对一个RDD标记为持久化,在第一个action触发后,该RDD会被持久化
scala> val  list = List("Hadoop","Spark","Hive")
list: List[String] = List(Hadoop, Spark, Hive)
scala> val  rdd = sc.parallelize(list)
  • 不使用的时候可以使用unpersist()进行删除

1.4 分区

RDD会被分成许多分区,保存在不同的节点上,对RDD进行分区,可以增加并行读、减少通信开销

  • 比如在按照userid进行join操作前,可以把数据按照userid进行分区,这样可以减少网络通信

分区的个数尽量等于集群中CPU核心的数目

scala> val  array = Array(1,2,3,4,5)
scala> val  rdd = sc.parallelize(array,2)
  • 分区主要支持key,value形数据,如果是非kv形式,一般会先转换成kv形式
  • 如果要实现自定义分区,需要继承org.apache.spark.Partitioner,并且实现三个方法
import org.apache.spark.{Partitioner, SparkContext, SparkConf}
//自定义分区类,需要继承org.apache.spark.Partitioner类
class MyPartitioner(numParts:Int) extends Partitioner{
  //覆盖分区数
  override def numPartitions: Int = numParts 
  //覆盖分区号获取函数
  override def getPartition(key: Any): Int = {
    key.toString.toInt%10
}
}
object TestPartitioner {
  def main(args: Array[String]) {
    val conf=new SparkConf()
    val sc=new SparkContext(conf)
    //模拟5个分区的数据
    val data=sc.parallelize(1 to 10,5)
    //根据尾号转变为10个分区,分别写到10个文件
    // 这里是把数字组合成了(1,1)(2,1)然后进行了分区,然后又从分区结果中把key(_._1)取了出来
    data.map((_,1)).partitionBy(new MyPartitioner(10)).map(_._1).saveAsTextFile("file:///usr/local/spark/mycode/rdd/partitioner")
  }
} 

1.5 综合实例

    scala> val  lines = sc. 
    |  textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
    scala> val wordCount = lines.flatMap(line => line.split(" ")).
    |  map(word => (word, 1)).reduceByKey((a, b) => a + b)
    scala> wordCount.collect()
    scala> wordCount.foreach(println)
集群中进行词频统计

2、键值对RDD

键值对RDD是指每个元素都是(key, value)类型

2.1 常用的键值对转换操作

  • reduceByKey(func),使用func函数合并具有相同键的值
    pairRDD.reduceByKey((a,b)=>a+b)
  • groupByKey() 对具有相同键的值进行分组,返回一个value-list,然后可以使用map等对这个valuelist进行操作
    scala>  val  words = Array("one", "two", "two", "three", "three", "three") 
    scala>  val  wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
    scala>  val  wordCountsWithReduce = wordPairsRDD.reduceByKey(_+_)
    scala>  val  wordCountsWithGroup = wordPairsRDD.
    |  groupByKey().map(t => (t._1, t._2.sum)) // t._1是key, t._2是value-list
  • keys key值返回生成一个新的RDD
  • values value返回生成一个新RDD
  • sortByKey() sortBy() sortBy可以不按key排序
    scala> val  d2 = sc.parallelize(Array((“c”,8),(“b”,25),(“c”,17),(“a”,42),(“b”,4),(“d”,9),(“e”,17),(“c”,2),(“f”,29),(“g”,21),(“b”,9)))
    scala> d2.reduceByKey(_+_).sortBy(_._2,false).collect
    res4: Array[(String, Int)] = Array((a,42),(b,38),(f,29),(c,27),(g,21),(e,17),(d,9))
  • mapValues(func) func作用在value上,不会对key有影响
  • join (K,V1) (K,V2) 返回一个(K, (V1, V2))
  • combineByKey
    combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine)
    createCombiner:在第一次遇到key时创建组合器函数,将RDD数据集中的V类型转换成C
    mergeValue:再遇到相同key时,将createCombiner的C和这次的V合并成C
    mergeCombiners:合并两个C
    partitioner:定义使用的分区函数
    mapSideCombine:是否在map端进行combine
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Combine {
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Combine").setMaster("local")
        val sc = new SparkContext(conf)
        val data = sc.parallelize(Array(("company-1",88),("company-1",96),("company-1",85),("company-2",94),("company-2",86),("company-2",74),("company-3",86),("company-3",88),("company-3",92)),3)
        val res = data.combineByKey(
            (income) => (income,1), // 把value变成了kv形式
            ( acc:(Int,Int), income ) => ( acc._1+income, acc._2+1 ), // 形成的kv 与v累加
            ( acc1:(Int,Int), acc2:(Int,Int) ) => ( acc1._1+acc2._1, acc1._2+acc2._2 ) // kv之间累加
        ).map({ case (key, value) => (key, value._1, value._1/value._2.toFloat) })
        res.repartition(1).saveAsTextFile("file:///usr/local/spark/mycode/rdd/result")
    }
}

2.2 一个例子

val  rdd = sc.parallelize(Array(("spark",2),("hadoop",6),("hadoop",4),("spark",6)))
rdd.mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1 / x._2)).collect()
例子

3、数据读写

3.1 文件读写

val  textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word123.txt")
textFile.saveAsTextFile("file:///usr/local/spark/mycode/wordcount/writeback")

3.2 HDFS

val  textFile = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt")
textFile.saveAsTextFile("writeback")

3.3 JSON

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.json.JSON
object JSONRead {
    def main(args: Array[String]) {
        val inputFile = "file:///usr/local/spark/examples/src/main/resources/people.json"
        val conf = new SparkConf().setAppName("JSONRead")
        val sc = new SparkContext(conf)
        val jsonStrs = sc.textFile(inputFile)
        val result = jsonStrs.map(s => JSON.parseFull(s))
        result.foreach( {r => r match {
                        case Some(map: Map[String, Any]) => println(map)
                        case None => println("Parsing failed")
                        case other => println("Unknown data structure: " + other)
                }
        }
        )
    }
}

3.4 HBase

  • 读取到的rdd都是(ImmutableBytesWritable, Result)
import org.apache.hadoop.conf.Configuration 
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.mapreduce.TableInputFormat 
import org.apache.hadoop.hbase.util.Bytes 
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf 
object SparkOperateHBase {
def main(args: Array[String]) {
    val conf = HBaseConfiguration.create()
    val sc = new SparkContext(new SparkConf())
    //设置查询的表名
    conf.set(TableInputFormat.INPUT_TABLE, "student")
    val stuRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])
    val count = stuRDD.count()
    println("Students RDD Count:" + count)
    stuRDD.cache()
    //遍历输出
    stuRDD.foreach({ case (_,result) =>
        val key = Bytes.toString(result.getRow)
        val name = Bytes.toString(result.getValue("info".getBytes,"name".getBytes))
        val gender = Bytes.toString(result.getValue("info".getBytes,"gender".getBytes))
        val age = Bytes.toString(result.getValue("info".getBytes,"age".getBytes))
        println("Row key:"+key+" Name:"+name+" Gender:"+gender+" Age:"+age)
    })
}
}
  • 写入的格式是row_key cf:col_1 cf:col_2
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.spark._
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
object SparkWriteHBase {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("SparkWriteHBase").setMaster("local")
    val sc = new SparkContext(sparkConf)
    val tablename = "student"
    sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)
    val job = new Job(sc.hadoopConfiguration)
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setOutputValueClass(classOf[Result])
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
    //下面这行代码用于构建两行记录
val indataRDD = sc.makeRDD(Array("3,Rongcheng,M,26","4,Guanhua,M,27"))
    val rdd = indataRDD.map(_.split(",")).map{arr=>{
      //设置行键(row key)的值
val put = new Put(Bytes.toBytes(arr(0)))
      //设置info:name列的值
put.add(Bytes.toBytes("info"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
//设置info:gender列的值
      put.add(Bytes.toBytes("info"),Bytes.toBytes("gender"),Bytes.toBytes(arr(2)))
      //设置info:age列的值
      put.add(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(arr(3).toInt))
      //构建一个键值对,作为rdd的一个元素
(new ImmutableBytesWritable, put)
    }}
    rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())
  }
}

4 综合实例

对一个文件进行二次排序,一个文件有两列,当第一列相同的时候按第二列排序

  • 有一个Comparable接口一样的Ordered
class SecondarySortKey(val first:Int,val second:Int) extends Ordered[SecondarySortKey] with Serializable {
def compare(other:SecondarySortKey):Int = {
    if (this.first - other.first !=0) {
         this.first - other.first
    } else {
      this.second - other.second
    }
  }
}

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SecondarySortApp {
  def main(args:Array[String]){
     val conf = new SparkConf().setAppName("SecondarySortApp").setMaster("local")
       val sc = new SparkContext(conf)
       val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/examples/file1.txt", 1)
       val pairWithSortKey = lines.map(line=>(new SecondarySortKey(line.split(" ")(0).toInt, line.split(" ")(1).toInt),line))
       val sorted = pairWithSortKey.sortByKey(false)
       val sortedResult = sorted.map(sortedLine =>sortedLine._2)
       sortedResult.collect().foreach (println)
  }
}

你可能感兴趣的:(Spark编程基础(Scala版)——RDD编程)