1、RDD编程基础
1.1 RDD创建
Spark采用textFile()方法来从文件系统中加载数据创建RDD
val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
val lines = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt")
1.2 RDD操作
转换操作
- filter(func) 筛选出满足func的元素,返回一个新的数据集
val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/word.txt")
val linesWithSpark=lines.filter(line => line.contains("Spark"))
- map(func)操作每个元素传递到函数func中,并返回一个新的数据集
data=Array(1,2,3,4,5)
val rdd1= sc.parallelize(data)
val rdd2=rdd1.map(x=>x+10)
- flatMap(func) 第一步,map;第二步,flat
- groupByKey() 应用于KV键值对数据集,返回(K, Iterable)形式的数据集
- reduceByKey(func) 应用于KV键值对数据集1,返回一个新的(K,V)数据集,其中的V是将每个key传递给func聚合后的结果
words.reduceByKey((a,b) => a+b)
行动操作
scala>rdd=sc.parallelize(Array(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int]=ParallelCollectionRDD[1] at parallelize at :24
scala> rdd.count()
res0: Long = 5
scala> rdd.first()
res1: Int = 1
scala> rdd.take(3)
res2: Array[Int] = Array(1,2,3)
scala> rdd.reduce((a,b)=>a+b)
res3: Int = 15
scala> rdd.collect()
res4: Array[Int] = Array(1,2,3,4,5)
scala> rdd.foreach(elem=>println(elem))
1
2
3
4
5
1.3 持久化
spark每次遇到行动操作,都会从头开始执行计算
- 可以通过持久化机制来避免重复计算的开销。使用persist()方法对一个RDD标记为持久化,在第一个action触发后,该RDD会被持久化
scala> val list = List("Hadoop","Spark","Hive")
list: List[String] = List(Hadoop, Spark, Hive)
scala> val rdd = sc.parallelize(list)
- 不使用的时候可以使用unpersist()进行删除
1.4 分区
RDD会被分成许多分区,保存在不同的节点上,对RDD进行分区,可以增加并行读、减少通信开销
- 比如在按照userid进行join操作前,可以把数据按照userid进行分区,这样可以减少网络通信
分区的个数尽量等于集群中CPU核心的数目
scala> val array = Array(1,2,3,4,5)
scala> val rdd = sc.parallelize(array,2)
- 分区主要支持key,value形数据,如果是非kv形式,一般会先转换成kv形式
- 如果要实现自定义分区,需要继承org.apache.spark.Partitioner,并且实现三个方法
import org.apache.spark.{Partitioner, SparkContext, SparkConf}
//自定义分区类,需要继承org.apache.spark.Partitioner类
class MyPartitioner(numParts:Int) extends Partitioner{
//覆盖分区数
override def numPartitions: Int = numParts
//覆盖分区号获取函数
override def getPartition(key: Any): Int = {
key.toString.toInt%10
}
}
object TestPartitioner {
def main(args: Array[String]) {
val conf=new SparkConf()
val sc=new SparkContext(conf)
//模拟5个分区的数据
val data=sc.parallelize(1 to 10,5)
//根据尾号转变为10个分区,分别写到10个文件
// 这里是把数字组合成了(1,1)(2,1)然后进行了分区,然后又从分区结果中把key(_._1)取了出来
data.map((_,1)).partitionBy(new MyPartitioner(10)).map(_._1).saveAsTextFile("file:///usr/local/spark/mycode/rdd/partitioner")
}
}
1.5 综合实例
scala> val lines = sc.
| textFile("file:///usr/local/spark/mycode/wordcount/word.txt")
scala> val wordCount = lines.flatMap(line => line.split(" ")).
| map(word => (word, 1)).reduceByKey((a, b) => a + b)
scala> wordCount.collect()
scala> wordCount.foreach(println)
2、键值对RDD
键值对RDD是指每个元素都是(key, value)类型
2.1 常用的键值对转换操作
- reduceByKey(func),使用func函数合并具有相同键的值
pairRDD.reduceByKey((a,b)=>a+b)
- groupByKey() 对具有相同键的值进行分组,返回一个value-list,然后可以使用map等对这个valuelist进行操作
scala> val words = Array("one", "two", "two", "three", "three", "three")
scala> val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
scala> val wordCountsWithReduce = wordPairsRDD.reduceByKey(_+_)
scala> val wordCountsWithGroup = wordPairsRDD.
| groupByKey().map(t => (t._1, t._2.sum)) // t._1是key, t._2是value-list
- keys key值返回生成一个新的RDD
- values value返回生成一个新RDD
- sortByKey() sortBy() sortBy可以不按key排序
scala> val d2 = sc.parallelize(Array((“c”,8),(“b”,25),(“c”,17),(“a”,42),(“b”,4),(“d”,9),(“e”,17),(“c”,2),(“f”,29),(“g”,21),(“b”,9)))
scala> d2.reduceByKey(_+_).sortBy(_._2,false).collect
res4: Array[(String, Int)] = Array((a,42),(b,38),(f,29),(c,27),(g,21),(e,17),(d,9))
- mapValues(func) func作用在value上,不会对key有影响
- join (K,V1) (K,V2) 返回一个(K, (V1, V2))
- combineByKey
combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine)
createCombiner:在第一次遇到key时创建组合器函数,将RDD数据集中的V类型转换成C
mergeValue:再遇到相同key时,将createCombiner的C和这次的V合并成C
mergeCombiners:合并两个C
partitioner:定义使用的分区函数
mapSideCombine:是否在map端进行combine
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Combine {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Combine").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.parallelize(Array(("company-1",88),("company-1",96),("company-1",85),("company-2",94),("company-2",86),("company-2",74),("company-3",86),("company-3",88),("company-3",92)),3)
val res = data.combineByKey(
(income) => (income,1), // 把value变成了kv形式
( acc:(Int,Int), income ) => ( acc._1+income, acc._2+1 ), // 形成的kv 与v累加
( acc1:(Int,Int), acc2:(Int,Int) ) => ( acc1._1+acc2._1, acc1._2+acc2._2 ) // kv之间累加
).map({ case (key, value) => (key, value._1, value._1/value._2.toFloat) })
res.repartition(1).saveAsTextFile("file:///usr/local/spark/mycode/rdd/result")
}
}
2.2 一个例子
val rdd = sc.parallelize(Array(("spark",2),("hadoop",6),("hadoop",4),("spark",6)))
rdd.mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1 / x._2)).collect()
3、数据读写
3.1 文件读写
val textFile = sc.textFile("file:///usr/local/spark/mycode/wordcount/word123.txt")
textFile.saveAsTextFile("file:///usr/local/spark/mycode/wordcount/writeback")
3.2 HDFS
val textFile = sc.textFile("hdfs://localhost:9000/user/hadoop/word.txt")
textFile.saveAsTextFile("writeback")
3.3 JSON
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.parsing.json.JSON
object JSONRead {
def main(args: Array[String]) {
val inputFile = "file:///usr/local/spark/examples/src/main/resources/people.json"
val conf = new SparkConf().setAppName("JSONRead")
val sc = new SparkContext(conf)
val jsonStrs = sc.textFile(inputFile)
val result = jsonStrs.map(s => JSON.parseFull(s))
result.foreach( {r => r match {
case Some(map: Map[String, Any]) => println(map)
case None => println("Parsing failed")
case other => println("Unknown data structure: " + other)
}
}
)
}
}
3.4 HBase
- 读取到的rdd都是(ImmutableBytesWritable, Result)
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkOperateHBase {
def main(args: Array[String]) {
val conf = HBaseConfiguration.create()
val sc = new SparkContext(new SparkConf())
//设置查询的表名
conf.set(TableInputFormat.INPUT_TABLE, "student")
val stuRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val count = stuRDD.count()
println("Students RDD Count:" + count)
stuRDD.cache()
//遍历输出
stuRDD.foreach({ case (_,result) =>
val key = Bytes.toString(result.getRow)
val name = Bytes.toString(result.getValue("info".getBytes,"name".getBytes))
val gender = Bytes.toString(result.getValue("info".getBytes,"gender".getBytes))
val age = Bytes.toString(result.getValue("info".getBytes,"age".getBytes))
println("Row key:"+key+" Name:"+name+" Gender:"+gender+" Age:"+age)
})
}
}
- 写入的格式是row_key cf:col_1 cf:col_2
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.spark._
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
object SparkWriteHBase {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("SparkWriteHBase").setMaster("local")
val sc = new SparkContext(sparkConf)
val tablename = "student"
sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)
val job = new Job(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
//下面这行代码用于构建两行记录
val indataRDD = sc.makeRDD(Array("3,Rongcheng,M,26","4,Guanhua,M,27"))
val rdd = indataRDD.map(_.split(",")).map{arr=>{
//设置行键(row key)的值
val put = new Put(Bytes.toBytes(arr(0)))
//设置info:name列的值
put.add(Bytes.toBytes("info"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
//设置info:gender列的值
put.add(Bytes.toBytes("info"),Bytes.toBytes("gender"),Bytes.toBytes(arr(2)))
//设置info:age列的值
put.add(Bytes.toBytes("info"),Bytes.toBytes("age"),Bytes.toBytes(arr(3).toInt))
//构建一个键值对,作为rdd的一个元素
(new ImmutableBytesWritable, put)
}}
rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())
}
}
4 综合实例
对一个文件进行二次排序,一个文件有两列,当第一列相同的时候按第二列排序
- 有一个Comparable接口一样的Ordered
class SecondarySortKey(val first:Int,val second:Int) extends Ordered[SecondarySortKey] with Serializable {
def compare(other:SecondarySortKey):Int = {
if (this.first - other.first !=0) {
this.first - other.first
} else {
this.second - other.second
}
}
}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SecondarySortApp {
def main(args:Array[String]){
val conf = new SparkConf().setAppName("SecondarySortApp").setMaster("local")
val sc = new SparkContext(conf)
val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/examples/file1.txt", 1)
val pairWithSortKey = lines.map(line=>(new SecondarySortKey(line.split(" ")(0).toInt, line.split(" ")(1).toInt),line))
val sorted = pairWithSortKey.sortByKey(false)
val sortedResult = sorted.map(sortedLine =>sortedLine._2)
sortedResult.collect().foreach (println)
}
}