参考文章
client和master模式的区别:
解压后,到bin目录,执行spark-shell.sh就可以启动一个单机模式的spark。
sc.textFile("file:///export/data/spark/words")
.flatMap(_.split(" "))
.map(_ -> 1)
.reduceByKey(_ + _)
.collect
standalone模式既独立模式,自带完整服务,可单独部署到一个集群中,无需依赖其他任何资源管理系统,只支持FIFO调度器。从一定程度上说,它是spark
on yarn 和spark on mesos
的基础。在standalone模式中,没有AM和NM的概念,也没有RM的概念,用户节点直接与master打交道,由driver负责向master申请资源,并由driver进行资源的分配和调度等等。
4.1 伪分布式:一个机器运行多个进程,Master和worker在一台机器上;
4.2 完全分布式:master和worker分布在不同机器上;
配置:修改slaves,修改spark-env.sh
启动后shell连接集群:
spark-shell --master node1:7077
4.3 高可用分布式:master和worker分布在不同机器上,有多于一个master。
第一,修改配置文件spark-env.sh
第二,启动集群
start-all.sh
第三,在node2上启动单独一个master
start-master.sh
要注意的是,spark的高可用集群不需要指定master,也不会投票选举,谁先启动,谁就是master,上面第三步启动第二个master就是备用master
6.1 代码
System.setProperty("HADOOP_USER_NAME", "root")
val conf = new SparkConf()
.setAppName("WordCount")
// 设置yarn-client模式提交
.setMaster("yarn")
// 设置resourcemanager的ip
///.set("yarn.resourcemanager.hostname","node1")
// .set("yarn.resourcemanager.scheduler.address","node1:8030")
// .set("yarn.resourcemanager.resource-tracker.address","node1:8031")
// 设置executor的个数
.set("spark.executor.instance","2")
// 设置executor的内存大小
.set("spark.executor.memory", "1024M")
//
// .set("spark.yarn.app.id","qqw")
// 设置提交任务的yarn队列
.set("spark.yarn.queue","prod")
// 设置driver的ip地址
.set("spark.driver.host","192.168.239.1")
.set("hadoop.yarn.timeline-service.enabled","false")
// 设置jar包的路径,如果有其他的依赖包,可以在这里添加,逗号隔开
.setJars(List("D:\\ideaprojects\\target\\spark-1.0-SNAPSHOT.jar"))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val value = this.getClass.getClassLoader.loadClass("org.apache.spark.scheduler.cluster.YarnClusterManager")
val context = new SparkContext(conf)
val file = context.textFile("hdfs://node1:8020/spark/data/words")
println(file.collect())
6.2 用maven打包,将打好的包设置在配置文件中
conf.setJars(List("D:\\ideaprojects\\target\\spark-1.0-SNAPSHOT.jar"))
6.3,在resource目录下新增三个配置文件
6.4,本地要有spark的运行环境,即安装解压spark安装包。
org.apache.spark
spark-core_2.11
${spark.version}
org.apache.spark
spark-yarn_2.11
${spark.version}
provided
7.1 RDD缓存:rdd.persist(),rdd.unpersist(),rdd.cache()
7.2 RDD持久化:rdd.checkpoint
7.3 RDD checkpoint恢复:
SparkContext有从ck路径回复的api:checkPointFile,但是这个方法时protected,不能在包外访问。
缓存的作用:如果一个作业由多个action,会触发多个job,多个job不会共享相同的父依赖,缓存之后会共享,可以节省计算资源。
package org.apache.spark
import org.apache.spark.rdd.RDD
object RddUtil {
def readFromCk[T](sc:SparkContext,path:String) = {
val value: RDD[Any] = sc.checkpointFile(path)
value.asInstanceOf[T]
}
}
package spark2
import org.apache.spark.rdd.RDD
import org.apache.spark.{RddUtil, SparkConf, SparkContext}
object RecoverFromCheckpoint {
def main(args: Array[String]): Unit = {
val name = new SparkConf().setMaster("local[*]")
.setAppName(this.getClass.getName)
val context = new SparkContext(name)
context.setLogLevel("WARN")
val value: RDD[(String, Int)]= RddUtil.readFromCk(context, "data/ck/value6/0e5f6c72-87ec-4470-b0d7-8f3fd18e3836/rdd-10")
value.foreach(println)
}
}
persons.foreachPartition(it=>{
val url = "jdbc:mysql://192.168.88.163:3306/spark?characterEncoding=UTF-8"
val connection: Connection = DriverManager.getConnection(url, "root", "123456")
val statement: PreparedStatement = connection.prepareStatement("insert into person(id,name) value(?,?)")
it.foreach(p=>{
statement.setInt(1,p._1)
statement.setString(2,p._2)
statement.execute();
})
})
val personsFromMysql = new JdbcRDD[Person](
sc,
() => {
val url = "jdbc:mysql://192.168.88.163:3306/spark?characterEncoding=UTF-8"
DriverManager.getConnection(url, "root", "123456")
},
"select id,name from person where id >= ? and id <= ?",
19,
30,
3,//numPartitions=3
res => Person(res.getInt(1), res.getString(2))
)
注意加注释(numPartitions=3)的这样,这是分区参数,会根据id的范围进行划分,比如上面的id是[19,30],会将其均分为三个区间[19,23)[23,27)[27,31]
一个action会创建一个DAG,一个action对应一个job,一个job对应什么?
task是指一个任务调度单元,可能是一组算子构成的pipeline,也可能是一个算子构成的pipeline。
stage是逻辑概念,是为了提升效率对具备某种条件的算子的整合,task是运行时概念,是基于stage而来,task由数量指标,task的数量有当前rdd的分区数决定,也可以理解为并行度。如果并行度是1,则可以认为一个stage就是一个task,当然这是为了理解stage和task的关系才这样表述。
基于同一个stage产生的所有task称之为taskset。
pipeline是stage和task的另一种状态,表示由多个算子构成的流水线,称之为管道。
spark基于宽依赖划分stage,同一个stage内的rdd的依赖关系都是窄依赖,某个宽依赖前后的依赖会被划分到两个不同的依赖中去。
判断宽窄依赖的依据是父子RDD的分区直接的关系,当且仅当父rdd的一个分区只被子rdd的一个分区依赖,称之窄依赖。
action会触发计算:sortBy/take
transformation创建RDD,但不会触发计算:map/reduceBy/
定义:可伸缩的分布式数据集。
5个特性:
1,每个RDD都作用在一组分区上;
2,每个RDD都使作用在分区数据的一个函数;
3,RDD之间有依赖关系:横向依赖和纵向依赖;
4,K-V格式的RDD有分区器Partitioner(HashPartitioner和RangePartitioner);
5,数据尽量本地化;
Spark 中,内存计算有两层含义:
spark的计算过程中没有落盘的动作。虽然hadoop也有缓存,但是hadoop会有落盘的动作。
根据前后两个RDD的分区器是否一致,如果一致不会有shuffle。只有key-value 类型的RDD才有分区器。如果一个RDD的数据来自多个父分区,说明有shuffle
用一个stage:子RDD的分区继承自父RDD,或者说子RDD的分区和父RDD的分区保持一致。
导致shuffle的算子:分区、keyBy、join
shuffle原理
stage划分
spark作业流程
mr的作业中间结果都会写入磁盘,由大量的io
spark利用DAG进行作业调度,将任务串联起来,充分利用缓存,极大的降低了磁盘io。
并行度可以通过如下三种方式来设置,可以根据实际的内存、CPU、数据以及应用程序逻辑的情况调整并行度参数
testRDD.groupByKey(24)
“spark.default.parallelism”
设置并行度,优先级次之val conf = new SparkConf()
conf.set("spark.default.parallelism", 24)
spark.sparkContext.parallelize(Seq(),3)
“$SPARK_HOME/conf/spark-defaults.conf”
文件中配置“spark.default.parallelism”
的值,优先级最低spark.default.parallelism 24
spark推荐堆内存,堆外内存和对内存不能完全融合,一个task使用了堆外内存无法同时使用堆内存。
堆内存的主要问题是GC的延迟导致spark内存预估不准。
堆外内存直接使用Unsafe的allocateMemory和freeMemory,对内存的预估更准确。
spark的任务调度分为5个步骤:
1,DAGSkeduler根据用户代码生成DAG,并划分Stage;
2,DAGSkeduler根据DAG生成Task和TaskSet;
3,SkedulerBackend获取集群空闲资源并以WorkerOffer单位提供;
4,TaskSkeduler根据Stage、Task的依赖关系、优先级确定调度规划;
5,SkedulerBackend根据第四步结构向TaskExecutor提交任务。
TaskSkeduler由两种调度:FIFO、公平调度。
TaskSkeduler调度时还会考虑数据本地性,数据本地性由四种级别:进程级别、节点级别、机架级别、Any。
涉及到shuffle的很容易变成:
task是线程级别的,executor是进程级别,一个executor可以运行多个task。
1,spark数据本地性
看算子,分类讨论。一类算子会直接记录宽窄依赖关系,另一类算子通过判断partitioner与父依赖是否一致来决定。
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration}
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 从HBase 表中读取数据,封装到RDD数据集
*/
object SparkReadHBase {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
.setMaster("local[*]")
val sc: SparkContext = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
// 读取HBase Client 配置信息
val conf: Configuration = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "node1")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent", "/hbase")
// 设置读取的表的名称
conf.set(TableInputFormat.INPUT_TABLE, "htb_wordcount")
/*
def newAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](
conf: Configuration = hadoopConfiguration,
fClass: Class[F],
kClass: Class[K],
vClass: Class[V]
): RDD[(K, V)]
*/
val resultRDD: RDD[(ImmutableBytesWritable, Result)] = sc.newAPIHadoopRDD(
conf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)
println(s"Count = ${resultRDD.count()}")
resultRDD
.take(5)
.foreach { case (rowKey, result) =>
println(s"RowKey = ${Bytes.toString(rowKey.get())}")
// HBase表中的每条数据封装在result对象中,解析获取每列的值
result.rawCells().foreach { cell =>
val cf = Bytes.toString(CellUtil.cloneFamily(cell))
val column = Bytes.toString(CellUtil.cloneQualifier(cell))
val value = Bytes.toString(CellUtil.cloneValue(cell))
val version = cell.getTimestamp
println(s"\t $cf:$column = $value, version = $version")
}
}
// 应用程序运行结束,关闭资源
sc.stop()
}
}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* 将RDD数据保存至HBase表中
*/
object SparkWriteHBase {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf()
.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
.setMaster("local[*]")
val sc: SparkContext = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
// 构建RDD
val list = List(("hadoop", 234), ("spark", 3454), ("hive", 343434), ("ml", 8765))
val outputRDD: RDD[(String, Int)] = sc.parallelize(list, numSlices = 2)
// 将数据写入到HBase表中, 使用saveAsNewAPIHadoopFile函数,要求RDD是(key, Value)
// 组装RDD[(ImmutableBytesWritable, Put)]
/**
* HBase表的设计:
* 表的名称:htb_wordcount
* Rowkey: word
* 列簇: info
* 字段名称: count
*/
val putsRDD: RDD[(ImmutableBytesWritable, Put)] = outputRDD.mapPartitions { iter =>
iter.map { case (word, count) =>
// 创建Put实例对象
val put = new Put(Bytes.toBytes(word))
// 添加列
put.addColumn(
// 实际项目中使用HBase时,插入数据,先将所有字段的值转为String,再使用Bytes转换为字节数组
Bytes.toBytes("info"), Bytes.toBytes("cout"), Bytes.toBytes(count.toString)
)
// 返回二元组
(new ImmutableBytesWritable(put.getRow), put)
}
}
// 构建HBase Client配置信息
val conf: Configuration = HBaseConfiguration.create()
// 设置连接Zookeeper属性
conf.set("hbase.zookeeper.quorum", "node1")
conf.set("hbase.zookeeper.property.clientPort", "2181")
conf.set("zookeeper.znode.parent", "/hbase")
// 设置将数据保存的HBase表的名称
conf.set(TableOutputFormat.OUTPUT_TABLE, "htb_wordcount")
/*
def saveAsNewAPIHadoopFile(
path: String,// 保存的路径
keyClass: Class[_], // Key类型
valueClass: Class[_], // Value类型
outputFormatClass: Class[_ <: NewOutputFormat[_, _]], // 输出格式OutputFormat实现
conf: Configuration = self.context.hadoopConfiguration // 配置信息
): Unit
*/
putsRDD.saveAsNewAPIHadoopFile(
"datas/spark/htb-output-" + System.nanoTime(), //
classOf[ImmutableBytesWritable], //
classOf[Put], //
classOf[TableOutputFormat[ImmutableBytesWritable]], //
conf
)
// 应用程序运行结束,关闭资源
sc.stop()
}
}
rdd转ds、df的前提:
1,import spark.implicits._
2,rdd的泛型是样例类或者是元组,不能是普通的bean
import spark.implicits._
val rddToDf: DataFrame= rdd.toDF()
val rddToDs: Dataset[WordCount] = rdd.toDS()
val df2dS: Dataset[WordCount] = rddToDf.as[WordCount]
val rdd1: RDD[Row] = rddToDf.rdd
val rdd2: RDD[WordCount] = df2dS.rdd
val frame: DataFrame = df2dS.toDF()
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("movie")
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("warn")
val sourceDf: DataFrame = spark.read
.option("sep", "\\t")
.schema("userId int,movieId int, rating int,timestamp long")
.csv("data/input/rating_100k.data")
// top10
//import org.apache.spark.sql.functions._
val frame: DataFrame = sourceDf
.groupBy("movieId")
.agg(
avg("rating") as ("avgRating"),
count("movieId") as ("cnt")
)
.filter("cnt > 200")
.select("movieId", "avgRating", "cnt")
.orderBy(col("avgRating").desc)
frame.write.csv("data/output/movie")
// sql
sourceDf.createOrReplaceTempView("movies")
val sql =
"""select movieId,avg(rating) as avg_r,count(1) as cnt
|from movies group by movieId
|having cnt > 200
|order by avg_r desc""".stripMargin
spark.sql(sql).take(10).foreach(row=>println("sql",row))
}
批处理一次把数据全部处理完
流处理是处理远远不断的大数据量
updateStateByKey 要设置checkpoint目录
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("warn")
sc.setCheckpointDir("data/ck/streamingwithstatus")
val ssc = new StreamingContext(sc,Duration(10000))
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 8001)
val words: DStream[(String, Int)] = ds.flatMap(_.split("\\s+")).map(_ -> 1)
val wordsCnt: DStream[(String, Int)] = words.updateStateByKey((currentVals,hisVals)=>{
var cnt = 0
if (hisVals != null) {
hisVals.foreach(f=> cnt += f)
}
currentVals.foreach(f=> cnt += f)
Some(cnt)
})
wordsCnt.print()
ssc.start()
ssc.awaitTermination()
}
mapWithState
def main(args: Array[String]): Unit = {
val mappingFunction: (String, Option[Int], State[Int]) =>(String,Int)
= (key, value, state) => {
val cnt = value.getOrElse(0)
val old: Int = state.getOption().getOrElse(0)
state.update(cnt+old)
(key, cnt+old)
}
val spec = StateSpec.function(mappingFunction).numPartitions(10)
val wordsCnt: DStream[(String, Int)] = words
.reduceByKey(_+_)
.mapWithState(spec)
wordsCnt.print()
ssc.start()
ssc.awaitTermination()
}
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getOrCreate(ckPath, createSC)
ssc.start()
ssc.awaitTermination()
ssc.stop(true,true)
}
def createSC() = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("warn")
sc.setCheckpointDir(ckPath)
val ssc = new StreamingContext(sc,Duration(3000))
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 8001)
val words: DStream[(String, Int)] = ds.flatMap(_.split("\\s+")).map(_ -> 1)
val wordsCnt: DStream[(String, Int)] = words.updateStateByKey((currentVals,hisVals)=>{
var cnt = 0
if (hisVals != null) {
hisVals.foreach(f=> cnt += f)
}
currentVals.foreach(f=> cnt += f)
Some(cnt)
})
wordsCnt.print()
ssc
}
使用窗口函数后,窗口将多个批次合并为一个rdd,通过transform函数获取rdd,然后对rdd进行各种操作。
比如,排序,不能直接作用再stream上,而是作用在stream封装的rdd上,此时需要transform。
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("warn")
val ssc = new StreamingContext(sc,Duration(2000))
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 8001)
val words: DStream[(String, Int)] = ds.flatMap(_.split("\\s+")).map(_ -> 1)
val keyValues: DStream[(String, Int)] = words
.window(Duration(8000),Duration(8000))
.reduceByKey(_+_)
// .reduceByKeyAndWindow((_:Int) + (_:Int),Duration(10000),Duration(8000))
keyValues.transform(rdd=>{
val sortedRDD: RDD[(String, Int)] = rdd.sortBy(_._2, false)
sortedRDD.take(3).foreach(x=>println("top3:"+x))
sortedRDD
}).print()
ssc.start()
ssc.awaitTermination()
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("warn")
sc.setCheckpointDir("data/ck/window")
val ssc = new StreamingContext(sc,Duration(2000))
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 8001)
val words: DStream[(String, Int)] = ds.flatMap(_.split("\\s+")).map(_ -> 1)
val keyValues: DStream[(String, Int)] = words
.window(Duration(8000),Duration(8000))
.updateStateByKey((currentVals,hisVals)=>{
var cnt = 0
if (hisVals != null) {
hisVals.foreach(f=> cnt += f)
}
currentVals.foreach(f=> cnt += f)
Some(cnt)
})
keyValues.print()
ssc.start()
ssc.awaitTermination()
}
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("warn")
sc.setCheckpointDir("data/ck/window")
val ssc = new StreamingContext(sc,Duration(2000))
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 8001)
val words: DStream[(String, Int)] = ds.flatMap(_.split("\\s+")).map(_ -> 1)
val keyValues: DStream[(String, Int)] = words
.window(Duration(8000),Duration(8000))
.updateStateByKey((currentVals,hisVals)=>{
var cnt = 0
if (hisVals != null) {
hisVals.foreach(f=> cnt += f)
}
currentVals.foreach(f=> cnt += f)
Some(cnt)
})
keyValues.foreachRDD(rdd=>{
val conn = DriverManager.getConnection("jdbc:mysql://node1:3306/spark?characterEncoding=UTF-8","root","root")
val sql:String = "REPLACE INTO `person` (`id`, `name`) VALUES (?, ?);"
val ps: PreparedStatement = conn.prepareStatement(sql)
rdd.foreachPartition(it=>{
it.foreach(item=>{
ps.setInt(1, item._2)
ps.setString(2, item._1)
ps.addBatch()
})
})
ps.executeBatch()
ps.close()
conn.close()
// 存储到hdfs文件中
rdd.saveAsTextFile("hdfs://node01:8020/spark/tmp/dstream")
})
ssc.start()
ssc.awaitTermination()
val session: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[*]").getOrCreate()
val sparkContext: SparkContext = session.sparkContext
sparkContext.setLogLevel("warn")
val streamingContext = new StreamingContext(sparkContext, Duration(10000l))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "node1:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("spark-integrate")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD(rdd => {
rdd.foreachPartition(it => {
it.foreach(re=>{
println(s"${re.key}-${re.value}")
})
})
})
streamingContext.start()
streamingContext.awaitTermination()
1,只能使用process time
2,只有低级api:很多运算要转换为rdd进行,比如保存、排序等
3,只能保证spark的一致性,不能实现端到端的一致性
4,批流api的不统一
kafka/socket/file/rate
rate
def main(args: Array[String]): Unit = {
val spark: SparkSession = getSparkSession()
val dataFrame: DataFrame = spark.readStream
.format("rate")
.option("rowsPerSecond","10")
.option("rampUpTime","0s")
.option("numPartitions",2)
.load
dataFrame.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
file
val spark: SparkSession = getSparkSession()
val dataFrame: DataFrame = spark.readStream
.format("csv")
.option("sep",";")
.schema("name String,age int,hobby string")
.load("data/input/persons")
dataFrame.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
socket :测试用,不支持容错
val spark = getSparkSession()
val frame: DataFrame = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 8001)
.load()
frame.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
从支不支持聚合、排序来区分
输出所有数据,支持聚合、排序
输出新增加的行,适用于数据产生后不再change的场景,不支持聚合
和cache有什么区别吗
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val frame: DataFrame = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 8001)
.load()
import spark.implicits._
import org.apache.spark.sql.functions._
val words: DataFrame = frame.toDF("words").flatMap(
row => row.getString(0).split(" ")
).toDF("word")
val wordCountDF: DataFrame = words.groupBy($"word")
.agg(
count($"word") as "wordCnt"
)
.select("word", "wordCnt")
val query: StreamingQuery = wordCountDF.writeStream
.outputMode("complete")
.format("memory")
.queryName("wordCnt")
.start()
while(true) {
Thread.sleep(1000)
spark.sql("select * from wordCnt").show()
}
query.awaitTermination()
}
Structed Streaming提供了几种输出的类型:
df
.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.start()
df
.writeStream
.format("console")
.start()
df
.writeStream
.queryName("aggregates")
.outputMode("complete")
.format("memory")
.start()
spark.sql("select * from aggregates").show()
writeStream
.foreach(...)
.start()
def main(args: Array[String]): Unit = {
val spark = getSparkSession()
val frame: DataFrame = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 8001)
.load()
import spark.implicits._
import org.apache.spark.sql.functions._
val words: DataFrame = frame.toDF("words").flatMap(
row => row.getString(0).split(" ")
).toDF("word")
val wordCountDF: DataFrame = words.groupBy($"word")
.agg(
count($"word") as "wordCnt"
)
.select("word", "wordCnt")
val query: StreamingQuery = wordCountDF.writeStream
// ******* 设置检查点路径
.option("checkpointLocation","data/output/cks")
.outputMode("complete")
.format("memory")
.queryName("wordCnt")
.start()
while(true) {
Thread.sleep(1000)
spark.sql("select * from wordCnt").show()
}
query.awaitTermination()
dropDuplicates会引起shuffle
import org.apache.spark.SparkContext
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* Desc: 演示Spark的 全局去重算子
* {"eventTime": "2016-01-10 10:01:50","eventType": "browse","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "click","userID":"1"}
* {"eventTime": "2016-01-10 10:01:55","eventType": "browse","userID":"1"}
* {"eventTime": "2016-01-10 10:01:55","eventType": "click","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "browse","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "aaa","userID":"1"}
* {"eventTime": "2016-01-10 10:02:00","eventType": "click","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "browse","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "click","userID":"1"}
* {"eventTime": "2016-01-10 10:01:51","eventType": "click","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "browse","userID":"1"}
* {"eventTime": "2016-01-10 10:01:50","eventType": "click","userID":"3"}
* {"eventTime": "2016-01-10 10:01:51","eventType": "aaa","userID":"2"}
*/
object Demo10_Deduplication {
def main(args: Array[String]): Unit = {
// 0. Init Env
val spark: SparkSession = SparkSession.builder()
.appName(this.getClass.getSimpleName)
.master("local[*]")
// 测试阶段经常用这个 参数
// 参数用来设置: 在shuffle阶段 默认的分区数量(如果你不设置默认是200)
.config("spark.sql.shuffle.partitions", "2")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
sc.setLogLevel("WARN")
// 经常用到的两个隐式转换我们也写上
import spark.implicits._
import org.apache.spark.sql.functions._
val socketDF: DataFrame = spark.readStream
.format("socket") // socket 不支持容错
.option("host", "node3")
.option("port", 9999)
.load()
// DF 转 DS 用as方法 对泛型做替换
val resultDF: DataFrame = socketDF.as[String]
// 传入数据是json, 我们需要用sql中的get_json_object函数来获取json中的字段
.select(
// 默认的列名 叫做value
// 取json的列, $表示的是json的root(根)
get_json_object($"value", "$.eventTime").as("event_time"),
get_json_object($"value", "$.eventType").as("event_type"),
get_json_object($"value", "$.userID").as("user_id")
// 通过这个selec + get_json_object方法就将数据转变成了 一个3个列的df对象
)
.dropDuplicates("user_id", "event_type")
.groupBy("user_id")
.count()
resultDF.writeStream
.format("console")
.outputMode(OutputMode.Complete())
.start()
.awaitTermination()
sc.stop()
}
}
repartition一定会shuffle
coalesce可以不shuffle,如果修改的分区比原分区小,比如当前rdd的分区是1000,修改后的分区是10,spark不会将这1000个分区合并为10个分区,而是分为10个组,rdd的compute在组内各个分区迭代。相当于降低父rdd的并行度。
如果父rdd的分区数是1000,要通过coalesce将分区数增大为2000,shuffle设置为false时是无效的。
cache和checkpoint都是transform操作,需要action触发,所以在cache和checkpoint之后需要count一下触发action,之后会缓存和持久化。
cache不会切断DAG,checkpoint会切断DAG。
package spark
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
object RDDCheckpiont {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = new SparkContext(conf)
sc.setCheckpointDir("data/output/ck")
sc.setLogLevel("warn")
val rdd: RDD[(String, Int)] = sc
.textFile("data/input/words.txt")
.flatMap(_.split("\\s+"))
.map(
word=>{
println(word+"77777")
(word,1)
}
)
val rdd1 = rdd.cache()
rdd1.checkpoint()
rdd1.count()
rdd1.reduceByKey(_+_).collect
rdd1.aggregateByKey(0)(_ + _, _ +_).collect()
rdd1.aggregateByKey(0)(_ + _, _ +_).collect()
Thread.sleep(100000000)
}
}
Dataset finalDS = newDS.join(oldDataSet,
oldDataSet.col(Constant.COMPANY_ID).equalTo(newDS.col(Constant.COMPANY_ID))
.and(oldDataSet.col(Constant.EVENT_TYPE).equalTo(newDS.col(Constant.EVENT_TYPE)))
.and(oldDataSet.col(Constant.EVENT_GROUP_ID).equalTo(newDS.col(Constant.EVENT_GROUP_ID)))
// .and(oldDataSet.col("index").equalTo(newDS.col("index")))
, Constant.FULL_JOIN).select(columns(oldDataSet, newDS));
private static Seq columns(Dataset oldDataSet, Dataset newDS) {
StructField[] fields = newDS.schema().fields();
List columns = new ArrayList<>();
for (StructField field : fields) {
String name = field.name();
if (Constant.EVENT_GROUP_ID.equals(name) || Constant.EVENT_TYPE.equals(name) || Constant.COMPANY_ID.equals(name)) {
Column column = functions.nanvl(newDS.col(name), oldDataSet.col(name)).as(name);
columns.add(column);
} else {
columns.add(newDS.col(name));
}
}
return JavaConverters.asScalaIteratorConverter(columns.listIterator()).asScala().toSeq();
}
val l = System.currentTimeMillis()
val spark = SparkUtils.sparkSession(SparkUtils.sparkConf("name"))
val sql = s"(select ${FieldToString.toStringSql} from hits_v1) tmp"
val ckDF = spark.read.format("jdbc")
.option("url", "jdbc:clickhouse://node1:8123/datasets")
.option("dbtable", sql)
.option("numPartitions", "5")
.option("partitionColumn", "WatchID")
.option("lowerBound","4611686725751467379")
.option("upperBound","9223371678237104442")
.load()
// .repartition(10)
println("ckDF.rdd.partitions.size" + ckDF.rdd.partitions.size)
println("ckDF.rdd.partitions.size" + ckDF.rdd.partitioner)
ckDF.write.format("jdbc")
.option("url", "jdbc:clickhouse://node2:8123/default")
.option("dbtable", "hits_v1")
.option("numPartitions", "2")
.option("batchSize",50000)
.mode(SaveMode.Append)
.save()
val cgRDD3: RDD[(String, (Iterable[String], Iterable[String], Iterable[String]))] = value2.cogroup(value2, value2)
cgRDD3.reduceByKey((it1,it2)=>{
(it1._1.++(it2._1) ,it1._2.++(it2._2),it1._3)
})
1,spark ui上executor列表右边列有标准日志和错误日志的连接
2,spark目录/work/应用id/executor_id/
直接使用dataset的map,如果在map中有schema变化,会导致丢失schema,新生成的RDD只有一个字段value,所以这种方式行不通。
要先通过Dataset转RDD,然后RDD转Dataset。
Dataset<Row> text = spark.read().text("data/teachers.txt");
text.show();
ClassTag tag = scala.reflect.ClassTag$.MODULE$.apply(Teacher.class);
// Dataset转RDD
RDD rdd = text.rdd().map(new MyMapFunction(), tag);
// RDD转Dataset
Dataset<Row> dataFrame = spark.createDataFrame(rdd, Teacher.class);
dataFrame.printSchema();
public class Teacher implements KryoRegistrator {
private String id;
private String name;
private String age;
public Teacher(String id, String name, String age) {
this.id = id;
this.name = name;
this.age = age;
}
@Override
public void registerClasses(Kryo kryo) {
kryo.register(Teacher.class, new FieldSerializer(kryo, Teacher.class)); //在Kryo序列化库中注册自定义的类
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAge() {
return age;
}
public void setAge(String age) {
this.age = age;
}
}
static class MyMapFunction implements scala.Function1<Row, Teacher>,Serializable {
@Override
public Teacher apply(Row row) {
String info = row.getString(0);
String[] split = info.split(",");
return new Teacher(split[0],split[1],split[2]);
}
@Override
public <A> Function1<A, Teacher> compose(Function1<A, Row> g) {
return Function1.super.compose(g);
}
@Override
public <A> Function1<Row, A> andThen(Function1<Teacher, A> g) {
return Function1.super.andThen(g);
}
}
JdbcDialects.registerDialect(new HiveDialet());
// 1,连接Hive
List confEntities = batchData.get(event);
BatchReadConfEntity confEntity = confEntities.get(0);
String table = EventTableRelationUtil.getTable(event);
String realTable = String.format("%s_%s_rt", odsOrDwd, table);
Dataset jdbcDF = spark.read()
.format("jdbc")
.option("url", "jdbc:hive2://10.49.2.135:7001/hudi_test?hive.resultset.use.unique.column.names=false")
.option("dbtable", realTable)
.option("user", "hadoop")
.option("header", "true")
.option("password", "HrKucX")
.option("driver", "org.apache.hive.jdbc.HiveDriver")
.load();
static class HiveDialet extends JdbcDialect {
@Override
public String quoteIdentifier(String colName) {
if(colName.contains(".")){
String colName1 = colName.substring(colName.indexOf(".")+1);
return "`"+colName1+"`";
}
return "`"+colName+"`";
}
@Override
public boolean canHandle(String url) {
return url.startsWith("jdbc:hive2");
}
}
org.apache.hive
hive-jdbc
3.0.0
定义和注册:
private static void registerUdf(SparkSession spark, Map> sidInfoMap) {
UserDefinedFunction udfSid = functions.udf((UDF3) (companyId, countryCode, sellerId) -> {
List confEntities = sidInfoMap.get(ClickHouseUtil.composeKey4Sid(companyId, countryCode, sellerId));
if (confEntities == null || confEntities.isEmpty()) {
return "0";
}
return confEntities.get(0).getSid();
}, DataTypes.StringType);
UserDefinedFunction udfTimezone = functions.udf((UDF3) (companyId, countryCode, sellerId) -> {
List confEntities = sidInfoMap.get(ClickHouseUtil.composeKey4Sid(companyId, countryCode, sellerId));
if (confEntities == null || confEntities.isEmpty()) {
return null;
}
return confEntities.get(0).getTimezoneName();
}, DataTypes.StringType);
spark.udf().register("udfSid",udfSid);
spark.udf().register("udfTimezone",udfTimezone);
}
udf使用:
Dataset finalDs = joinedDS.selectExpr("*", "udfSid(company_id,country_code,seller_id) as sid", "udfTimezone(company_id,country_code,seller_id) as timezone_name");
private static Seq columns(Dataset oldDataSet, Dataset newDS) {
StructField[] fields = newDS.schema().fields();
List columns = new ArrayList<>();
for (StructField field : fields) {
String name = field.name();
if (Constant.EVENT_GROUP_ID.equals(name) || Constant.EVENT_TYPE.equals(name) || Constant.COMPANY_ID.equals(name)) {
Column column = functions.coalesce(newDS.col(name), oldDataSet.col(name)).as(name);
columns.add(column);
} else if (Constant.POSTED_DATE.equals(name) || Constant.POSTED_DATE_LOCALE.equals(name)){
columns.add(functions.to_timestamp(functions.regexp_replace(newDS.col(name),"T"," "),"yyyy-MM-dd HH:mm:ss").as(name));
}
else if (Constant.QUANTITY.equals(name)){
columns.add(functions.coalesce(newDS.col(name),functions.lit(0)).cast("int").as(name));
} else {
columns.add(newDS.col(name));
}
}
return JavaConverters.asScalaIteratorConverter(columns.listIterator()).asScala().toSeq();
}
val plan = df.queryExecution.logical
val estimated: BigInt = spark
.sessionState
.executePlan(plan)
.optimizedPlan
.stats
.sizeInBytes
DATE_FORMAT(CAST(UNIX_TIMESTAMP('09/2021', 'MM/yyyy') AS TIMESTAMP), 'yyyy-MM') as pricing_month
// 打印schema
println(row.schema)
// 得到Row中的数据并往其中添加我们要新增的字段值
val buffer = Row.unapplySeq(row).get.map(_.asInstanceOf[String]).toBuffer
buffer.append("男") //增加一个性别
buffer.append("北京") //增肌一个地址
// 获取原来row中的schema,并在原来Row中的Schema上增加我们要增加的字段名以及类型.
val schema: StructType = row.schema
.add("gender", StringType)
.add("address", StringType)
// 使用Row的子类GenericRowWithSchema创建新的Row
val newRow: Row = new GenericRowWithSchema(buffer.toArray, schema)
// 使用新的Row替换成原来的Row
newRow
spark-shell --version
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
val hiveHost: String = _
// 创建SparkSession实例
val spark = SparkSession.builder()
.config("hive.metastore.uris", s"thrift://hiveHost:9083")
.enableHiveSupport()
.getOrCreate()
// 读取Hive表,创建DataFrame
val df: DataFrame = spark.sql(“select * from salaries”)
df.show
spark-sql CLI 的集成方式多了一层限制,那就是在部署上,spark-sql CLI 与 Hive Metastore 必须安装在同一个计算节点。换句话说,spark-sql CLI 只能在本地访问 Hive Metastore,而没有办法通过远程的方式来做到这一点。
.select(split(col("value"),",").as("splitCol"))
.select(
col("splitcol").getItem(0).cast(IntegerType).as("id"),
col("splitcol").getItem(1).as("name"))
.createTempView("tmp")
参考文献
standalone模式下集群节点名称:master、worker
on yarn模式下:driver、AM、RM、executor
// =====================================================分区设置为4
val sogouRDD: RDD[String] = context.textFile("data/input/SogouQ.sample",4)
case class SoGouRecord(ts:String, userId:String,key:String, resultRank:String,clickRank:String,url:String)
val soGouRecordRDD: RDD[SoGouRecord] = sogouRDD.mapPartitions(it => {
it.filter(item => StringUtils.isNotBlank(item) && item.split("\\s+").length == 6)
.map(item => {
val columns: Array[String] = item.split("\\s+")
columns(2) = columns(2).replaceAll("\\[|\\]","")
SoGouRecord(columns(0), columns(1), columns(2), columns(3), columns(4), columns(5))
})
})
val keyCount: RDD[(String, Int)] =
soGouRecordRDD.map(record => record.userId + record.key -> 1)
.reduceByKey(_ + _)
.sortBy(_._2,false)
keyCount.take(3).foreach(println)
val value: RDD[(String, Int)] = soGouRecordRDD
.map(record => record.userId -> 1)
.reduceByKey(_ + _)
println(value.partitioner)
val userCount: RDD[(String, Int)] = value
.sortBy(_._2,false)
//========================结果保存到hdfs
userCount.saveAsTextFile("hdfs://node1:8020/spark/output2")
1,rdd容错:先cache后checkpoint,是在rdd维度上checkpoint
重启之后从checkpoint过的rdd恢复数据和计算
2,spark sql DataFrame容错:和rdd容错一致
3,spark streaming容错:有状态算子+SparkContext设置检查点路径
4,structured streaming容错:在输出DataStreamWriter设置checkpoint路径
.option("checkpointLocation","data/output/cks")
1,rdd连接mysql
读:
val personsFromMysql = new JdbcRDD[Tuple2[Int,String]](
sc,
() => {
val url = "jdbc:mysql://192.168.88.163:3306/spark?characterEncoding=UTF-8"
DriverManager.getConnection(url, "root", "123456")
},
"select id,name from person where id >= ? and id <= ?",
19,
30,
4,//numPartitions=3
res => (res.getInt(1), res.getString(2))
)
写:
persons.foreachPartition(it=>{
val url = "jdbc:mysql://192.168.88.163:3306/spark?characterEncoding=UTF-8"
val connection: Connection = DriverManager.getConnection(url, "root", "123456")
val statement: PreparedStatement = connection.prepareStatement("insert into person(id,name) value(?,?)")
it.foreach(p=>{
statement.setInt(1,p._1)
statement.setString(2,p._2)
statement.addBatch();
})
statement.executeBatch()
statement.close()
connection.close()
})
2,spark sql连接mysql
val frame: DataFrame = spark.read
.jdbc(
"jdbc:mysql://192.168.88.163:3306/spark?characterEncoding=UTF-8",
"person",
"id",
19,
30,
3, //numPartitions=3
pro
)
frame.write.jdbc
3,spark streaming连接mysql
输入不能是mysql,输出转化为rdd
keyValues.foreachRDD(rdd=>{
val conn = DriverManager.getConnection("jdbc:mysql://node1:3306/spark?characterEncoding=UTF-8","root","root")
val sql:String = "REPLACE INTO `person` (`id`, `name`) VALUES (?, ?);"
val ps: PreparedStatement = conn.prepareStatement(sql)
rdd.foreachPartition(it=>{
it.foreach(item=>{
ps.setInt(1, item._2)
ps.setString(2, item._1)
ps.addBatch()
})
})
ps.executeBatch()
ps.close()
conn.close()
// 存储到hdfs文件中
rdd.saveAsTextFile("hdfs://node1:8020/spark/tmp/dstream")
})
4,structured streaming连接mysql
数据源不能是mysql
输出通过foreach转化为DataSet连接mysql
cntDF.writeStream
.outputMode("complete")
.foreachBatch((ds,idx)=>{
ds.write
.mode("overwrite")
.format("jdbc")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("url","jdbc:mysql://node3:3306/spark?serverTimezone=UTC&characterEncoding=utf8&useUnicode=true")
.option("user","root")
.option("password","123456")
.option("dbtable","person")
.save()
})
.start()
.awaitTermination()
这不是技术问题,是历史问题,spark的引擎是基于RDD的,最初他就是一个离线计算引擎,后来实时计算的需求越来越多,才逐渐开始开发流处理能力,spark都是基于RDD进行封装和优化,RDD本质是批处理引擎,所以spark只能做微批处理、近实时处理。
SPARK_HOME=/export/server/spark
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 512m \
--executor-memory 512m \
--num-executors 1 \
--total-executor-cores 2 \
--class cn.hello.WordCount \
hdfs://node1:8020/spark/apps/wc.jar \
hdfs://node1:8020/wordcount/input/words.txt hdfs://node1:8020/wordcount/output
spark2-submit \ # 第1行
--class com.google.datalake.TestMain \ #第2行
--master yarn \ # 第3行
--deploy-mode client \ # 第4行
--driver-memory 3g \ # 第5行
--executor-memory 2g \ # 第6行
--total-executor-cores 12 \ # 第7行
--jars /home/jars/test-dep-1.0.0.jar,/home/jars/test-dep2-1.0.0.jar,/home/jars/test-dep3-1.0.0.jar \ # 第8行
/home/release/jars/test-sql.jar \ # 第9行
para1 \ # 第10行
para2 \ # 第11行
"test sql" \ # 第12行
parax # 第13行
示例分析:
PS:第9行是告诉第2行的入口类所在的jar包,是第一个非KV形式的参数,这个参数之前的参数全部是KV形式的,这个参数之后的参数都是单值,会被组合为一个数组传递给main函数
./bin/spark-submit
–class com.spark.Test.WordCount
–master spark://node1:7077
–executor-memory 4G
–total-executor-cores 6
/home/hadoop/data/jar-test/sql-1.0-yarnCluster.jar
source /test1/test2/test.sh
source $localroot/scripts/$config
appPid=0
start_process(){
spark-submit --master yarn-cluster \
--name apple.banana\
--principal $principal \
--keytab $keytab \
--num-executors 4 \
--executor-cores 2 \
--executor-memory 8G \
--driver-memory 8G \
--conf spark.locality.wait=10 \
--conf spark.kryoserializer.buffer.max=500MB \
--conf spark.serializer="org.apache.spark.serializer.KryoSerializer" \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.task.maxFailures=8 \
--conf spark.streaming.kafka.maxRatePerPartition=1000 \
--conf spark.driver.maxResultSize=5g \
--conf spark.driver.extraClassPath=$localroot/config \
--conf spark.executor.userClassPathFirst=true \
--conf spark.yarn.executor.memoryOverhead=4g \
--conf spark.executor.extraJavaOptions="test" \
--conf spark.yarn.cluster.driver.extraJavaOptions="test" \
--files $localroot/config/test.properties,\
$localroot/config/test1.txt,\
$localroot/config/test2.txt,\
$localroot/config/test3.txt \
--class com.peanut.test \
--jars $SPARK_HOME/jars/streamingClient010/kafka_2.10-0.10.0.0.jar,\
$SPARK_HOME/jars/streamingClient010/kafka-clients-0.10.0.0.jar,\
$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-0-10_2.11-2.1.0.jar,\
$localroot/lib/test1.jar,\
$localroot/lib/test2.jar,\
$localroot/lib/test3.jar \
$jarpath \
$args1 \
$args2 \
$args3 &
appPid=$!
}
start_process
echo "pid is"
echo $appPid
exit`
spark数据倾斜
cache是为了提高计算性能,把会重复使用的rdd缓存,提升计算性能。
checkpoint是为了容错,对代价昂贵的rdd进行checkpoint,异常恢复后就直接从checkpoint处恢复,无需从头计算。
cache不会截断DAG,checkpoint会截断DAG.
针对RDD本身的操作在driver,本质上为了创建DAG,如foreachRDD,transform;
涉及到对数据的计算则是在executor
val df: DataFrame = _
df.cache.count
val plan = df.queryExecution.logical
val estimated: BigInt = spark
.sessionState
.executePlan(plan)
.optimizedPlan
.stats
.sizeInBytes
package com.asinking.ch.ods;
import com.asinking.utils.ConfigUtil;
import com.asinking.utils.SparkUtils;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.StructType;
import scala.collection.JavaConverters;
import scala.collection.Seq;
import scala.reflect.ClassTag;
import javax.security.auth.login.Configuration;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
public class WriteMysqlTest {
public static void main(String[] args) throws SQLException {
Connection connection = DriverManager.getConnection(ConfigUtil.mysqlUrl(), ConfigUtil.mysqlUser(), ConfigUtil.mysqlPwd());
PreparedStatement preparedStatement = connection.prepareStatement("insert teacher(id,name,subject) values (?,?,?)");
preparedStatement.setLong(1,6);
preparedStatement.setString(2,"zhangxxxx666");
preparedStatement.setString(3,"mmmmmmath666");
preparedStatement.execute();
preparedStatement.clearParameters();
preparedStatement.setLong(1,7);
preparedStatement.setString(2,"zhangxxxx777");
preparedStatement.setString(3,"mmmmmmath777");
preparedStatement.execute();
preparedStatement.close();
}
public static void main2(String[] args) {
SparkSession spark = SparkUtils.sparkSession(SparkUtils.sparkConf("mysqltest"));
List<Teacher> teachers = new ArrayList<>();
Teacher teacher = new Teacher();
teacher.id = 1;
teacher.name = "zhangsan";
teacher.subject = "match";
teachers.add(teacher);
Seq<Teacher> tmpSeq = JavaConverters.asScalaIteratorConverter(teachers.iterator()).asScala().toSeq();
ClassTag tag = scala.reflect.ClassTag$.MODULE$.apply(Teacher.class);
RDD<Teacher> rdd = spark.sparkContext().makeRDD(tmpSeq, 10, tag);
Dataset<Row> dataFrame = spark.createDataFrame(rdd, Teacher.class);
dataFrame.printSchema();
dataFrame.show();
dataFrame.registerTempTable("tmp");
Dataset<Row> sql = spark.sql("select 'eng1111' as subject, 'zhangsi555' as name, 2 as id from tmp");
sql.write().format("jdbc")
.mode("append")
.option("url", ConfigUtil.mysqlUrl())
.option("user",ConfigUtil.mysqlUser())
.option("password",ConfigUtil.mysqlPwd())
.option("dbtable","teacher")
.save();
}
public static class Teacher{
private int id;
private String name;
private String subject;
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getSubject() {
return subject;
}
public void setSubject(String subject) {
this.subject = subject;
}
}
}
13.1 如果有1T数据,单机运行需要30分钟,但是使用Saprk计算需要两个小时(4node),为什么?
1)、发生了计算倾斜。大量数据给少量的task计算。少量数据却分配了大量的task。
2)、开启了推测执行机制。
3)、分片太多,shuffle太多,调度时间长,网络数据传输时间长,文件读写时间长。
13.2、对于ETL(数据清洗流程)类型的数据,开启推测执行、重试机制,对于最终的执行结果会不会有影响?
有影响,最终数据库中会有重复数据。
解决方案:
1)、关闭各种推测、重试机制。
2)、设置一张事务表。
joinType可以是”inner”、“left”、“right”、“full”分别对应inner join, left join, right join, full join,默认值是”inner”,代表内连接
同一个线程的多个job是串行,同一个job的不同stage是串行
不同线程的多个job在资源不足的情况下是串行,在资源充足的情况下是并行
https://blog.csdn.net/zwgdft/article/details/88349295
16.1, sc.paralize()
16.2, 文件(csv/txt/parquet)
16.3,分布式文件系统(hdfs、s3、cos)
16.4,数据库(mysql、oracle、clickhouse、kudu、es、hbase)
16.7,消息引擎(kafka)
第一种方法:
1,将DS转换为RDD,使用RDD的map,将Row转换为Entity
RDD rdd = joinedDS.rdd().map(new MyMapFunction(sidInfoMap), scala.reflect.ClassTag$.MODULE$.apply(OdsSpFinanceEventsEntity.class));
2,将新生成的RDD转换为DS
Dataset newWithSidDS = spark.createDataFrame(rdd, OdsSpFinanceEventsEntity.class);
这种方法代码多,且一旦对应的数据库表字段发生变化,entity类也要随之变化。
第二种方法:
使用UDF:
1,定义udf
private static void registerUdf(SparkSession spark, Map> sidInfoMap) {
UserDefinedFunction udfSid = functions.udf((UDF3) (companyId, countryCode, sellerId) -> {
List confEntities = sidInfoMap.get(ClickHouseUtil.composeKey4Sid(companyId, countryCode, sellerId));
if (confEntities == null || confEntities.isEmpty()) {
return "0";
}
return confEntities.get(0).getSid();
}, DataTypes.StringType);
UserDefinedFunction udfTimezone = functions.udf((UDF3) (companyId, countryCode, sellerId) -> {
List confEntities = sidInfoMap.get(ClickHouseUtil.composeKey4Sid(companyId, countryCode, sellerId));
if (confEntities == null || confEntities.isEmpty()) {
return null;
}
return confEntities.get(0).getTimezoneName();
}, DataTypes.StringType);
spark.udf().register("udfSid",udfSid);
spark.udf().register("udfTimezone",udfTimezone);
}
2,使用udf
Dataset finalDs = joinedDS.selectExpr("*", "udfSid(company_id,country_code,seller_id) as sid", "udfTimezone(company_id,country_code,seller_id) as timezone_name");
Listing leaf files and directories for 1847 paths:
cosn://akd-flink-test-1254213275/sp/ods/ods_sp_service_fee_event/90128259979419648/impB6jVXNI1tDP4an_ax7e0NjCjcVxtSN2Ckn8CZank/.hoodie_partition_metadata, ...
InMemoryFileIndex at HoodieSparkUtils.scala:113
文件数大于32时会生成job去查询文件是否存在,避免driver压力过大。
参考