2018-10-29 18:38:28 ERROR YarnClientSchedulerBackend:70 - Yarn application has already exited with state FINISHED!
2018-10-29 18:38:28 ERROR TransportClient:233 - Failed to send RPC 4830215201639506599 to /192.168.29.128:48563: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2018-10-29 18:38:28 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:91 - Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 4830215201639506599 to /192.168.29.128:48563: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
[root@CentOS ~]# spark-shell --master yarn --deploy-mode client
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-10-29 18:45:44 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = yarn, app id = application_1540809754248_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala >
[root@CentOS ~]# cd /usr/spark-2.3.0/
[root@CentOS spark-2.3.0]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/spark-2.3.0/logs/spark-root-org.apache.spark.deploy.master.Master-1-CentOS.out
[root@CentOS spark-2.3.0]# ./sbin/start-slave.sh spark://CentOS:7077
starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark-2.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-CentOS.out
链接Spark服务计算
连接Spark集群
[root@CentOS spark-2.3.0]# ./bin/spark-shell --master spark://CentOS:7077 --total-executor-cores 5
2018-11-02 19:33:51 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app-20181102193402-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>sc.textFile("file:///root/worlds.log").flatMap(_.split(" ")).map((_,1)).groupByKey().map(x=>(x._1,x._2.sum)).collect().foreach(println)
本地仿真
[root@CentOS spark-2.3.0]# ./bin/spark-shell --master local[5]
2018-11-02 19:51:43 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = local[5], app id = local-1541159513505).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
这里Spark AppMaster相当于Standalone模式下的SchedulerBackend,Executor相当于standalone的ExecutorBackend,spark AppMaster中包括DAGScheduler和YarnClusterScheduler。 Spark on Yarn的执行流程可以参考http://www.csdn.net/article/2013-12-04/2817706--YARN spark on Yarn部分。
scala> var list=Array("hello world","ni hao") list: Array[String] = Array(hello world, ni hao)
scala> var rdd1=sc.parallelize(list) //并行化 rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at :27 scala> rdd1.partitions.length //partitions的数是前面本地仿真指定的 res5: Int = 5
注意:用户可以手动指定切片|分区数目var rdd1=sc.parallelize(list,3)
scala> sc.parallelize(List(1,2,4,5),3).partitions.length
res13: Int = 3
读取外部数据创建RDD
scala> sc.textFile("hdfs://CentOS:9000/demo/src") res16: org.apache.spark.rdd.RDD[String] = hdfs://CentOS:9000/demo/src MapPartitionsRDD[22] at textFile at :25
scala> sc.textFile("hdfs://CentOS:9000/demo/src").map(.split(" ").length).reduce(+_) res19: Int = 13
Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.
Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
scala> var rdd=sc.parallelize(Array(("张三",1000),("李四",100),("赵六",300),("张三",500)))
scala> rdd.aggregateByKey(0)((x,y)=>x+y,(x,y)=>x+y).collect()
sortByKey([ascending], [numPartitions])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
scala> var rdd1=sc.parallelize(Array(("001","张三"),("002","李四"),("003","王五")))
scala> var rdd2=sc.parallelize(Array(("001","苹果"),("002","手机"),("001","橘子")))
scala> rdd1.cogroup(rdd2).collect()
res24: Array[(String, (Iterable[String], Iterable[String]))] = Array((002,(CompactBuffer(李四),CompactBuffer(手机))), (003,(CompactBuffer(王五),CompactBuffer())), (001,(CompactBuffer(张三),CompactBuffer(苹果, 橘子))))
Action算子
reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
scala> var rdd=sc.parallelize(List("a","b","c"))
scala> rdd.reduce(_+","+_)
res27: String = a,b,c
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
在驱动器程序中将数据集的所有元素作为数组返回。这通常在过滤器或其他操作返回有用的数据子集后很有用。
count()
Return the number of elements in the dataset.
返回数据集中的元素数量。
scala> var rdd=sc.parallelize(List("a","b","c"))
scala> rdd.count()
res28: Long = 3
first()|take(n)
Return the first element of the dataset (similar to take(1)).
scala> var rdd=sc.parallelize(List("a","b","c"))
scala> rdd.first()
res29: String = a
scala> rdd.take(1)
res30: Array[String] = Array(a)
scala> rdd.take(2)
res31: Array[String] = Array(a, b)
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
scala> sc.textFile("file:///root/worlds.log").flatMap(_.split(" ")).map(x=>(x,1)).countByKey()
res55: scala.collection.Map[String,Long] = Map(this -> 1, demo -> 1, is -> 1, good -> 2, up -> 1, a -> 1, come -> 1, babay -> 1, on -> 1, day -> 2, study -> 1)
foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
人口普查
需求:某个文件夹下存在如下 日志文件名yob年份.txt日志的数据格式如下
名字,性别,人数、
....
要求按年份统计出每一年新生婴儿男女比例,并绘制报表?
Hadoop MapReduce
TextInputFormat读取
//Mapper
class UserSexCountMapper extends Mapper[LongWritable,Text,Text,Text]{
override def map(key: LongWritable, value: Text, context: Mapper[LongWritable, Text, Text, Text]#Context): Unit ={
val path = context.getInputSplit().asInstanceOf[FileSplit].getPath().getParent()
var filename=path.getParent().getName()
var year=filename.substring(filename.lastIndexOf(".")-4,filename.lastIndexOf("."))
val tokens = value.toString.split(",")
context.write(new Text(year),new Text(tokens(1)+":"+tokens(2)))
}
}
//Reducer
class UserSexReducer extends Reducer[Text,Text,Text,Text]{
override def reduce(key: Text, values: lang.Iterable[Text], context: Reducer[Text, Text, Text, Text]#Context): Unit = {
var mtotal=0
var ftotal=0
for(i <- values){
var value:Text= i
var sex=value.toString.split(":")(0)
if(sex.equals("M")){
mtotal += value.toString.split(":")(1).toInt
}else{
ftotal += value.toString.split(":")(1).toInt
}
}
context.write(key,new Text("男:"+mtotal+",女:"+ftotal))
}
}
//提交任务
......
Spark解决
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
/**
* 写一个类SexAndCountVector表示男孩和女孩的数量
* case class是可以没有方法实现的
* @param m
* @param f
*/
case class SexAndCountVector(var m:Int,var f:Int)
object TestNamesDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local[10]") //本地仿真-->远程部署:"spark:CentOS:7077"
conf.setAppName("names counts")
val sc = new SparkContext(conf)
var cacheRDD = sc.wholeTextFiles("file:///D:/demo/names")
.map(touple=>(getYear(touple._1),touple._2.split("\r\n")))
.flatMap(tupe=>for(i<-tupe._2) yield (tupe._1,{
var s=i.split(",")
//s(1)+":"+s(2) //将"2016,F,14772"格式->"2016,F:14772"这种格式
if (s(1).equals("M")) {
new SexAndCountVector(s(2).toInt, 0) //男性性别的向量
} else {
new SexAndCountVector(0, s(2).toInt) //女性性别的向量
}
})).reduceByKey((s1,s2)=>{
s1.f=s1.f+s2.f //如果是女性求女性向量的和
s1.m=s1.m+s2.m //如果是男性求男性向量的和
s1 //返回s1
},1).persist(StorageLevel.DISK_ONLY) //指定为1个分区 cache()/persist()缓存->cacheRDD.unpersist() 清除缓存
cacheRDD.map(tuple=>tuple._1+"\t"+"男:"+tuple._2.m+",女:"+tuple._2.f) //转化成"2017 男:1,女:1"格式
.saveAsTextFile("file:///D:/demo/names_result")//远程部署:"file:///root/names_result"
sc.stop()
}
def getYear(name:String):String={
val i = name.lastIndexOf(".")
return name.substring(i-4,i)//yob2017.txt index为3到7的子串
}
}
/**
* Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
* can be used to block until the the job finishes executing or can be used to cancel the job.
*/
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
这段代码中有个很重要的东西,eventProcessLoop(DAGSchedulerEventProcessLoop是一个DAGScheduler的内部类),调用了post方法发送了JobSubmitted消息。从源码中可以看到,接收到消息后,调用了dagScheduler的handleJobSubmitted方法。这个方法是DAGScheduler的job调度的核心入口
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
//第一步、使用触发job的最后一个RDD,创建finalStage,这个方法就是简单的创建了一个stage,
//并且将stage加入到DAGScheduler的缓存(stage中有个重要的变量isShuffleMap)
finalStage = newResultStage(finalRDD, partitions.length, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
if (finalStage != null) {
//第二步、用finalStage创建一个job
val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
//第三步、将job加入到内存缓存中
activeJobs += job
finalStage.resultOfJob = Some(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//第四步、(很关键)使用submitStage方法提交finalStage
//这个方法会导致第一个stage提交,其他的stage放入waitingStages队列,使用递归优先提交父stage
submitStage(finalStage)
}
//提交等待的stage队列
submitWaitingStages()
}
接下来看下第四步调用的submitStage方法,这个是stage划分算法的入口,但是stage划分算法是有submitStage和getMissingParentStages方法共同组成的。
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//很关键的一行,调用getMissingParentStage方法去获取这个stage的父stage
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
//其实这里会循环递归调用,直到最初的stage没有父stage,其余的stage被放在waitingMissingStages
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//这个就是提交stage的方法。后面再分享
submitMissingTasks(stage, jobId.get)
} else {
//如果不为空,就是有父Stage,递归调用submitStage方法去提交父Stage,这里是stage划分算法的精髓。
for (parent <- missing) {
submitStage(parent)
}
//并且将当前stage,放入等待执行的stage队列中
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
这里再来看下getMissingParentStage方法
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting 先入后出
val waitingForVisit = new Stack[RDD[_]]
//定义visit方法,供后面代码中stage的RDD循环调用
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
//遍历RDD
for (dep <- rdd.dependencies) {
dep match {
//如果是宽依赖,使用宽依赖的RDD创建一个新的stage,并且会把isShuffleMap变量设置为true
//默认只有最后一个stage不是ShuffleMap stage
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
//把stage放到缓存中
missing += mapStage
}
//如果是窄依赖,就把rdd加入到stack中,虽然循环时调用了stack的pop方法,但是这里又push了一个进去。
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
//首先,往stack中推入stage的最后一个rdd
waitingForVisit.push(stage.rdd)
//循环
while (waitingForVisit.nonEmpty) {
//对stage的最后一个rdd,调用自己内部定义方法(就是上面的visit方法),注意这里stack的pop方法取出rdd
visit(waitingForVisit.pop())
}
//立刻返回新的stage
missing.toList
}
public class PersistApp { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(PersistApp.class.getSimpleName()).setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD linesRDD = sc.textFile("E:\test\scala\access_2016-05-30.log"); linesRDD.cache();
long start = System.currentTimeMillis();
List list = linesRDD.take(10);
long end = System.currentTimeMillis();
System.out.println("first times cost" + (end - start) + "ms");
System.out.println("-----------------------------------");
start = System.currentTimeMillis();
long count = linesRDD.count();
end = System.currentTimeMillis();
System.out.println("second times cost" + (end - start) + "ms");
sc.close();
在一个类型T的对象obj上使用SparkContext.brodcast(obj)方法,创建一个Broadcast[T]类型的广播变量,obj必须满足Serializable。 通过广播变量的.value()方法访问其值。
另外,广播过程可能由于变量的序列化时间过程或者序列化变量的传输过程过程而成为瓶颈,而Spark Scala中使用的默认的Java序列化方法通常是低效的,因此可以通过spark.serializer属性为不同的数据类型实现特定的序列化方法(如Kryo)来优化这一过程。
object BroadCastApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("BroadCastApp")
val sc = new SparkContext(conf)
val list = List(1, 2, 4, 6, 0, 9)
val set = mutable.HashSet[Int]()
val num = 7
val bset = sc.broadcast(set)
val bNum = sc.broadcast(7)
val listRDD = sc.parallelize(list)
listRDD.map(x => {
bset.value.+=(x)
x * bNum.value
}).foreach(x => print(x + " "))
println("----------------------")
for (s <- set) {
println(s)
}
sc.stop()
}
}
//将一个任意的数组类型转换为Dataset var persons= new Person("zhansgan",18)::new Person("wangwu",26)::Nil var personDataset=persons.toDS() personDataset.map( person=> person.name ).collect().foreach(println)
将一个DF转换为Dataset
//将一个DF转换为Dataset val personDataset = spark.read.json("file:///D:/examples/src/main/resources/people.json").as[Person] personDataset.show() val value = List(1,2,3).toDS() value.show()
创建DataFrame
可以把任何一个RDD转化成DataFrame,一个DataFrame就能对应一张表。
加载json文件创建DataFrame
//加载json文件 val df = spark.read.json("file:///文件路径") df.show()
通过RDD元素转换为case class 直接创建DataFrame
case class定义了table的schema。case class中的参数名称会被反射读取成为table中的列名。case class同样可以嵌套或是包含一些复杂类型,例如Seq或是Array。RDD可以被隐式转换为DataFrame并注册成为一个table。注册后的table可以用于随后的SQL语句。
//通过RDD元素转换为case class 直接创建DataFrame
val dataFrame = spark.sparkContext.textFile("file:///D:/person.txt")
.map(line => line.split(","))
.map(tokens => new Person(tokens(0).toInt, tokens(1), tokens(2).toBoolean, tokens(3).toInt, tokens(4).toFloat))
.toDF()
通过直接将元组类型RDD转为DataFrame
//通过直接将元组类型RDD转为DataFrame val dataFrame = spark.sparkContext.textFile("file:///D:/person.txt") .map(line => line.split(",")) .map(tokens=>(tokens(0),tokens(1),tokens(2),tokens(3),tokens(4))) .toDF("id","name","sex","age","salary")
通过编程方式创建DataFrame
//通过编程方式创建DataFrame val dataRDD = spark.sparkContext.textFile("file:///D:/person.txt") .map(line => line.split(",")) .map(tokens=>Row(tokens(0).toInt,tokens(1),tokens(2).toBoolean,tokens(3).toInt,tokens(4).toFloat)) var fields=StructField("id",IntegerType,true)::StructField("name",StringType,true)::StructField("sex",BooleanType,true)::StructField("age",IntegerType,true)::StructField("salary",FloatType,true)::Nil var schema=StructType(fields) val dataFrame = spark.createDataFrame(dataRDD,schema)
val dataFrame = spark.sparkContext.textFile("file:///D:/person.txt")
.map(line => line.split(","))
.map(tokens=>(tokens(0).toInt,tokens(1),tokens(2).toBoolean,tokens(3).toInt,tokens(4).toFloat))
.toDF("id","name","sex","age","salary")
//创建局部视图 ,只能由当前SparkSession会话访问
dataFrame.createOrReplaceTempView("t_user")
spark.sql("select * from t_user where id=1 or name='lisi' order by salary desc limit 2").show()
//创建全局视图,可以跨session访问需要在前面添加global_temp
spark.sql("select * from global_temp.t_user where id=1 or name='lisi' order by salary desc limit 2").show()
//获取指定列的值 记住必须导入 import spark.implicits._
spark.sql("select * from global_temp.t_user where id=1 or name='lisi' order by salary desc limit 2")
.map(row => row.getAs[String]("name"))
.foreach(name=>println("name:"+name))
//获取多个值 默认 系统没有提供对Map[String,Any]类型隐式转换
implicit var e=Encoders.kryo[Map[String,Any]]
spark.sql("select * from t_user where id=1 or name='lisi' order by salary desc limit 2")
.map(row => row.getValuesMap(List("id","name","sex")))
.foreach(row => println(row))
用户自定义聚合函数
1,苹果,4.5,2,001
2,橘子,2.5,5,001
3,机械键盘,800,1,002
val dataFrame = spark.sparkContext.textFile("file:///D:/order.log")
.map(line => line.split(","))
.map(tokens=>(tokens(0).toInt,tokens(1),tokens(2).toFloat,tokens(3).toInt,tokens(4)))
.toDF("id","name","price","count","uid")
dataFrame.createTempView("t_order")
spark.sql("select uid,sum(price * count) cost from t_order group by uid").show()
Similar to that of RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.
Transformation Meaning map(func) Return a new DStream by passing each element of the source DStream through a function func. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true. repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions. union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream. count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel. countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. updateStateByKey(func) Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.
Define the state - The state can be an arbitrary data type. 定义状态-状态可以是任意的数据类型。
Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream. 定义状态更新函数-用函数指定如何使用先前状态和输入流中的新值更新状态。
val updateFunction:(Seq[Int], Option[Int])=>Option[Int] = (newValues,runningCount)=>{ val newCount = runningCount.getOrElse(0)+newValues.sum Some(newCount) }
object QuickExample { def main(args: Array[String]): Unit = { var checkpoint="file:///D://checkpoint" def createStreamContext():StreamingContext={ val conf = new SparkConf() .setMaster("local[2]") .setAppName("NetworkWordCount") var sc=new SparkContext(conf) sc.setLogLevel("FATAL") val ssc = new StreamingContext(sc, Seconds(3)) ssc.checkpoint(checkpoint) ssc.socketTextStream("CentOS", 9999) .flatMap(.split(" ")) .map((,1)) .updateStateByKey(updateFunction) .checkpoint(Seconds(30)) //设置checkpoint存储频率,推荐batches的5~10倍 .print()
ssc
}
val ssc=StreamingContext.getOrCreate(checkpoint,createStreamContext _)
//开始计算
ssc.start()
ssc.awaitTermination()
}
val updateFunction:(Seq[Int], Option[Int])=>Option[Int] = (newValues,runningCount)=>{
val newCount = runningCount.getOrElse(0)+newValues.sum
Some(newCount)
}
}
CheckPoint说明
Window Operations
Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. The following figure illustrates this sliding window.、
Transformation Meaning window(windowLength, slideInterval) Return a new DStream which is computed based on windowed batches of the source DStream. countByWindow(windowLength, slideInterval) Return a sliding window count of elements in the stream. reduceByWindow(func, windowLength, slideInterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel. reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation. countByValueAndWindow(windowLength,slideInterval, [numTasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
Output Operations on DStreams
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs). Currently, the following output operations are defined:
Output Operation Meaning print() Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API. saveAsTextFiles(prefix, [suffix]) Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". saveAsObjectFiles(prefix, [suffix]) Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Python API This is not available in the Python API. saveAsHadoopFiles(prefix, [suffix]) Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Python API This is not available in the Python API. foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
Basic sources:Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.
import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets
import org.apache.spark.internal.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
class CustomReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
// There is nothing much to do as the thread calling receive()
// is designed to stop by itself if isStopped() returns false
}
/** Create a socket connection and receive data until receiver is stopped */
private def receive() {
var socket: Socket = null
var userInput: String = null
try {
// Connect to host:port
socket = new Socket(host, port)
// Until stopped or connection broken continue reading
val reader = new BufferedReader(
new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
userInput = reader.readLine()
while(!isStopped && userInput != null) {
store(userInput)
userInput = reader.readLine()
}
reader.close()
socket.close()
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again")
} catch {
case e: java.net.ConnectException =>
// restart if could not connect to server
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
// restart if there is any other error
restart("Error receiving data", t)
}
}
}
import org.apache.spark._
import org.apache.spark.streaming._
object QuickExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("NetworkWordCount")
conf.set("spark.io.compression.codec","lz4")
var sc=new SparkContext(conf)
sc.setLogLevel("FATAL")
var ssc= new StreamingContext(sc,Seconds(1))
ssc.receiverStream(new CustomReceiver("CentOS",9999)).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()
//开始计算
ssc.start()
ssc.awaitTermination()
}
}
昨晚和朋友聊天,喝了点咖啡,由于我经常喝茶,很长时间没喝咖啡了,所以失眠了,于是起床读JVM规范,读完后在朋友圈发了一条信息:
JVM Run-Time Data Areas:The Java Virtual Machine defines various run-time data areas that are used during execution of a program. So
Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners:
1. DDL statements (e.g. CREATE TABLE, DROP TABLE, etc.)
nginx在运行过程中是否稳定,是否有异常退出过?这里总结几项平时会用到的小技巧。
1. 在error.log中查看是否有signal项,如果有,看看signal是多少。
比如,这是一个异常退出的情况:
$grep signal error.log
2012/12/24 16:39:56 [alert] 13661#0: worker process 13666 exited on s
方法一:常用方法 关闭XML验证
工具栏:windows => preferences => xml => xml files => validation => Indicate when no grammar is specified:选择Ignore即可。
方法二:(个人推荐)
添加 内容如下
<?xml version=
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml&quo
最主要的是使用到了一个jquery的插件jquery.media.js,使用这个插件就很容易实现了。
核心代码
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.