Spark Core通过log信息由浅到深分析架构原理和工作流程

本文完全从0开始了解spark到深入理解spark core

一,概念,基础

================================================================

1.前提

编译:

./make-distribution.sh --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver
本地模式运行
local: spark-shell --master local[4]
standalone模式运行

stark-shell --master spark://hadoop:7077
standalone其实可以看做启动资源管理器,初始化和启动运行计算的JVM环境,真正计算时和集群无关
2.spark特征,相关概念以及核心词汇

Spark:分布式,内存,迭代,技术堆栈(shell,sql,streaming,rdd批处理,机器学习,图,R)
RDD :immutable, partitioned,in parallel,dependences(宽依赖/窄依赖)
DAG,Task(split对应pipline,算子与partition的相对性,Lazy,DAG最小单元-->窄依赖的pipline级别),每个Stage的Task组合为TaskSet, TaskScheduler对每个Stage的TaskSet调度,FIFO
Worker:启动Executor,执行Task,存储Blocks
cache(StorageLevel):耗时,计算链长,shuffle前后,checkpoint,persist,RDD粗粒度(DAG,Task和资源管理无关),repartition
内存/磁盘,Lineage容错,Task重试(4次),Stage重试(只重试对应的Task,3次)

3.SparkContext

SparkContext包含四大核心对象: DAGScheduler, TaskScheduler, SchedulerBackend, MapOutputTrackerMaster
参数SparkConf创建SparkEnv-->Spark UI-->new Scheduler(TaskScheduler,SchedulerBackend)-->new DAGScheduler-->启动TaskScheduler,DAGSchedule-->启动Executor
a)DAGScheduler是面向Job的Stage的高度调度器
b)TaskScheduler接口,更具体的Cluster Manager的不同会有不同的实现,Standlone模式下具体的实现是TaskSchedulerImpl
c)SchedulerBackend接口,根据具体的Cluster Manager的不同会有不同的实现,Standalone模式下具体实现是SparkDeploySchedulerBackend(被TaskSchedulerImpl管理,与Master连接注册当前程序,管理Executors(注册),发送Task到具体的Executors执行);
d)MapOutputTrackerMaster负责shuffle中数据输出和读取的管理

参考spark-shell启动日志如下

16/08/27 09:48:46 INFO SparkContext: Running Spark version 1.6.1
16/08/27 09:48:46 INFO SecurityManager: Changing view acls to: hadoop
16/08/27 09:48:46 INFO SecurityManager: Changing modify acls to: hadoop
16/08/27 09:48:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/08/27 09:48:46 INFO Utils: Successfully started service 'sparkDriver' on port 36200.
16/08/27 09:48:46 INFO Slf4jLogger: Slf4jLogger started
16/08/27 09:48:46 INFO Remoting: Starting remoting
16/08/27 09:48:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:46556]
16/08/27 09:48:46 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 46556.
16/08/27 09:48:46 INFO SparkEnv: Registering MapOutputTracker
16/08/27 09:48:46 INFO SparkEnv: Registering BlockManagerMaster
16/08/27 09:48:46 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-93deff69-4f2c-4b93-b505-ba4d7facbad1
16/08/27 09:48:46 INFO MemoryStore: MemoryStore started with capacity 511.5 MB
16/08/27 09:48:47 INFO SparkEnv: Registering OutputCommitCoordinator
16/08/27 09:48:47 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/08/27 09:48:47 INFO SparkUI: Started SparkUI at http://192.168.0.3:4040
16/08/27 09:48:47 INFO AppClient$ClientEndpoint: Connecting to master spark://hadoop:7077...
16/08/27 09:48:47 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160827094847-0001
16/08/27 09:48:47 INFO AppClient$ClientEndpoint: Executor added: app-20160827094847-0001/0 on worker-20160827094621-192.168.0.3-7078 (192.168.0.3:7078) with 5 cores
16/08/27 09:48:47 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160827094847-0001/0 on hostPort 192.168.0.3:7078 with 5 cores, 4.0 GB RAM
16/08/27 09:48:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36399.
16/08/27 09:48:47 INFO NettyBlockTransferService: Server created on 36399
16/08/27 09:48:47 INFO BlockManagerMaster: Trying to register BlockManager
16/08/27 09:48:47 INFO AppClient$ClientEndpoint: Executor updated: app-20160827094847-0001/0 is now RUNNING
16/08/27 09:48:47 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.3:36399 with 511.5 MB RAM, BlockManagerId(driver, 192.168.0.3, 36399)
16/08/27 09:48:47 INFO BlockManagerMaster: Registered BlockManager
16/08/27 09:48:47 INFO EventLoggingListener: Logging events to file:/opt/single/spark/data/history/app-20160827094847-0001
16/08/27 09:48:47 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/08/27 09:48:47 INFO SparkILoop: Created spark context..
Spark context available as sc.


4.Broadcast / Accumulator

val broadcastNumber = sc.broadcast(number)
val bn = data.map(_* broadcastNumber.value)
作用:大变量,join,Task冗余/传输,节省内存OOM,消息,共享,通信,同步
Driver(发送App全局只读变量)-->Execturor内存-->Task

累加器Accumulator:
val count = sc.accumulator(0)
Executor只能修改(累加),不可读,Driver可读.全局唯一状态

Shuffle:
Map阶段实现ShuffleManager中getWriter来写数据,BlockManager写到Memory,Disk推荐MEMORY_AND_DISK; Redcue阶段ShuffleManager的getReader向Driver获取上个Stage数据)
1.Hash Shuffle(Key不能是Array,不需要排序-->小规模数据处理速度比Sorted Shuffle快)
缺点:产生海量的小文件,内存-->解决:Spark的Consalidate机制(产生cores*R)
2.Sort-Based Shuffle解决海量数据速度问题
缺点:MapTask的数量过大,依旧会产生很多小文件-->reduce大量的记录来反序列化-->大量内存消耗和GC;Mapper和Reducer两次排序

二:流程架构

================================================================

1.Demo

sc.textFile("/home/hadoop/test/spark/input/sample_binary_classification_data.txt").flatMap(_.split(" ")).map(word=>(word,1))
.reduceByKey(_+_).saveAsTextFile("/home/hadoop/test/spark/output")
2.Spark启动:
启动Master(资源管理分配,作业提交和分配资源)-->工作节点启动Worker,汇报状态
-->当有submit或spark-shell的Action时:(Client或Master上的)Driver初始化SparkContext时提交App-->Master分配AppId和资源(根据配置分配并记录,不追踪具体资源(优缺点))-->默认:在每个节点启动CoarseGrainedExecutorBackend(最大化使用CPU,Memory,FIFO排队,一般集群只运行1个App)
App提交运行架构:
App=Driver[SparkContext]+Executor[ThreadPool,执行Task]

其中Job提交过程:
sparkshell-->Action-->submit Job-->Driver初始化SparkContext[1.SparkDeploySchedulerBackend资源管理调度, 2.DAGScheduler高层调度(Job的Stage,数据本地性),3.TaskSchedulerImpl具体Stage内部底层调度(每个Task调度,Task容错),4.MapOutputTracker负责Shuffle数据输出和读取管理]

Task执行过程:
Task在Executor中,Executor与CoarseGrainedExecutorBackend一一对应; CoarseGrainedExecutorBackend接受到TaskSetManager发过来LaunchTask消息反序列化TaskDescription-->在对应的Executor中执行LaunchTask-->TaskRunner

3.spark-shell角度运行过程解析:
Driver Program-->SparkContext(例如:spark-shell/submit)-->SC向资源管理器(standalone,yarn,mesos)申请运行Exector资源,启动StandaloneExcutorBackend-->Executor向SC申请task-->SC获取Executor后,发送代码到Executor-->SC构建RDD DAG分解Stage,提交Stage给TaskScheduler-->TaskScheduler运行Task在Executor上-->Task执行完毕释放资源

启动Spark-shell[主要信息为:ClientEndpoint/SparkDeploySchedulerBackend],即只启动Application,实例化SparkContext,向Master注册Application,获得ExecutorBackend计算资源,没有Job触发;
Action触发Job后-->DAGScheduler划分Stage-->TaskSchedulerImpl的TaskSetManager管理Stage的TaskSet
    locality aware分配计算资源
    监控Task执行状态:重试(最大重试次数是4次),慢任务进行推测式执行
TaskScheduler与SchedulerBackend底层任务调度过程概括:
TaskSetManager确定Task调度规则(FIFO/FAIR),TaskSchedulerImple对集群每个ExecutorBackend的可用资源(cores)统计(reviveOffers)并转换为可执行的Task数量(TaskDescription),shuffle负载均衡,数据本地性(locality aware)优先级别先后launchTask到相应的Executor中(广播Task的size小于AkkFrameSize128Mb-200kb)

参考Demo运行在spark-shell中log日志如下

16/08/27 09:49:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 191.1 KB, free 191.1 KB)
16/08/27 09:49:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.8 KB, free 212.9 KB)
16/08/27 09:49:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.3:36399 (size: 21.8 KB, free: 511.5 MB)
16/08/27 09:49:50 INFO SparkContext: Created broadcast 0 from textFile at :28
16/08/27 09:49:50 INFO FileInputFormat: Total input paths to process : 1
16/08/27 09:49:50 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/08/27 09:49:50 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/08/27 09:49:50 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/08/27 09:49:50 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/08/27 09:49:50 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/08/27 09:49:50 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/08/27 09:49:50 INFO SparkContext: Starting job: saveAsTextFile at :28
16/08/27 09:49:50 INFO DAGScheduler: Registering RDD 3 (map at :28)
16/08/27 09:49:50 INFO DAGScheduler: Got job 0 (saveAsTextFile at :28) with 2 output partitions
16/08/27 09:49:50 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at :28)
16/08/27 09:49:50 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/08/27 09:49:50 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/08/27 09:49:50 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at :28), which has no missing parents
16/08/27 09:49:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 217.1 KB)
16/08/27 09:49:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 219.4 KB)
16/08/27 09:49:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.3:36399 (size: 2.3 KB, free: 511.5 MB)
16/08/27 09:49:50 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/08/27 09:49:50 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at :28)
16/08/27 09:49:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/08/27 09:49:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop, partition 0,PROCESS_LOCAL, 2163 bytes)
16/08/27 09:49:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, hadoop, partition 1,PROCESS_LOCAL, 2163 bytes)
16/08/27 09:49:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop:45173 (size: 2.3 KB, free: 2.7 GB)
16/08/27 09:49:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop:45173 (size: 21.8 KB, free: 2.7 GB)
16/08/27 09:49:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 923 ms on hadoop (1/2)
16/08/27 09:49:51 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 916 ms on hadoop (2/2)
16/08/27 09:49:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/08/27 09:49:51 INFO DAGScheduler: ShuffleMapStage 0 (map at :28) finished in 0.929 s
16/08/27 09:49:51 INFO DAGScheduler: looking for newly runnable stages
16/08/27 09:49:51 INFO DAGScheduler: running: Set()
16/08/27 09:49:51 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/08/27 09:49:51 INFO DAGScheduler: failed: Set()
16/08/27 09:49:51 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at :28), which has no missing parents
16/08/27 09:49:51 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 71.6 KB, free 291.0 KB)
16/08/27 09:49:51 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.9 KB, free 315.9 KB)
16/08/27 09:49:51 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.3:36399 (size: 24.9 KB, free: 511.5 MB)
16/08/27 09:49:51 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/08/27 09:49:51 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at :28)
16/08/27 09:49:51 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/08/27 09:49:51 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, hadoop, partition 0,NODE_LOCAL, 1894 bytes)
16/08/27 09:49:51 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, hadoop, partition 1,NODE_LOCAL, 1894 bytes)
16/08/27 09:49:51 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop:45173 (size: 24.9 KB, free: 2.7 GB)
16/08/27 09:49:51 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to hadoop:47034
16/08/27 09:49:51 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 143 bytes
16/08/27 09:49:51 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 323 ms on hadoop (1/2)
16/08/27 09:49:51 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at :28) finished in 0.323 s
16/08/27 09:49:51 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 321 ms on hadoop (2/2)
16/08/27 09:49:51 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
16/08/27 09:49:51 INFO DAGScheduler: Job 0 finished: saveAsTextFile at :28, took 1.393126 s
4.进程角度:
SparkApplication:SparkSubmit(Driver所在的进程)-->Driver-->main方法,SparkContext
Master:在每个Worker中分配资源启动Executor
Worker进程:ExecutorRunner->CoarseGrainedExecutorBackend

CoarseGrainedExecutorBackend->Executor->Thread Pool(无状态,代表计算资源,不关心具体运行的代码)->Thread(Runnable接口)->Task

参考jps如下:

hadoop@hadoop:logs$ jps
8209 Worker
10634 SparkSubmit
10742 CoarseGrainedExecutorBackend
24352 Jps
8055 Master
5.Spark on Yarn(Cluster模式:Driver集群某台机器上JVM进程;Client)
bin/spark-submit --master yarn --deploy-mode client/cluster --class  ...
Client提交App-->RM为App创建AppMaster在NM上-->AppMaster(Driver)中SparkContext通过[DAG Scheduler,(YarnClusterScheduler)]申请Container-->RM细粒度分配Container-->AM{SC[containner]}-->RM/SC[containner]SC将Container发给相应的NM上,命令NM启动Executor(这里另一种说法:RM直接讲Containner发到NM上)-->NM-->Executor

6.Spark HA:
zookeeper中包含的元数据:Worker,Driver,Application
Active挂掉-->ZK根据全局状态选举要成为Active的standby-->通过ZK保存的元数据对此standby数据恢复-->完成后,Standby从Recovery级别转化为Active级别
选举和Recovery过程中:Spark不提供对外服务,但内部运行正常(1.粗粒度资源分配,2.App运行时,只是Driver与Worker通信,不需要Master)
Master HA的四大方式分别是:ZOOKEEPER,FILESYSTEM,CUSTOM,NONE;
7.Master HA的内部工作机制
    Standby Master1,Standby Master2-->使用zookeeper自动Leader的Master-->选举出Leader-->使用ZookeeperPersistenceEngine去读取集群的状态数据,Drivers,Applications,Workers,Executors等信息--->判断元数据信息是否有空的内容-->把通过Zookeeper持久化引擎获得的Drivers,Applications,Executors,Workers等信息重新注册到Master的内存中缓存起来-->(验证获得的信息和当前正在运行的集群状态的一致性)讲Applications和Workers的状态表示伟UNKOWN,然后回想Application中的Driver以及Workers发送现在是Leader的Standby模式的Master的地址信息-->当Drivers和Workers收到新的Master的地址信息后会响应该信息-->Master接收到来自Drivers和Workers的响应信息后会使用一个关键的方法completeRecovery()来对没有响应的Applications,Workers(Executors)进行处理,处理完毕后Master的Stage会变成state = RecoveryState.ALIVE,从而可以开始对外提供服务-->此时Master调用自己的shedule方法对正在等待的Applications和Drivers进行资源调度!


三. Master/Driver/Executor启动注册流程(注册->资源分配->启动)

================================================================

(一).Master接受Worker/Driver/Application注册
1.Worker启动向Master注册
    -->Master通过filter方法过滤状态为DEAD的Worker,对UNKNOWN的Worker清理掉其曾经的Worker信息并替换为新的Worker信息
    -->把注册的Worker加入到Master内存的数据结构中-->通过持久化引擎例如Zookeeper把注册信息持久化起来-->Scheduler

2.Driver启动向Master注册
    -->Master将Driver的信息放入内存缓存中
    -->加入等待调度的队列-->通过持久化引擎例如Zookeeper把注册信息持久化起来-->Scheduler
3.Applicaiton启动向Master注册
    -->Driver启动后初始化SparkContext-->产生SparkDeploySchedulerBackend,内部AppClient的内部ClientEndpoint发送RegisterApplication信息给Master
    -->Master将Application的信息放入内存缓存中
    -->把Application加入到等待调度的Application队列中-->通过持久化引擎例如Zookeeper把注册信息持久化起来-->Scheduler

参考Master启动日志

16/08/27 10:33:35 INFO Master: Registering worker 192.168.0.3:7078 with 5 cores, 4.0 GB RAM
16/08/27 12:03:56 INFO Master: Registering app Spark shell
16/08/27 12:03:56 INFO Master: Registered app Spark shell with ID app-20160827120356-0000
16/08/27 12:03:56 INFO Master: Launching executor app-20160827120356-0000/0 on worker worker-20160827103334-192.168.0.3-7078
(二).Master资源调度给Executor,注意先后顺序:应用程序获得资源-->Task调度(Dag,Task,backend)
(Alive的)Master启动Driver-->资源调度(时机:App提交,集群资源情况改变,Executor增减,Worker增减等)
    -->FIFO启动Executor,Cores尽可能的满足要求
    -->Master通过远程通信将分配Executor的信息send给Worker
    -->Worker启动ExecutorBackend
(三).Master启动Driver,Executor
    1.Master LaunchDriver-->Worker{[DriverRunner]内部使用Thread来处理Driver的启动-->创建Driver在本地文件系统的工作目录-->封装好Driver的启动的Command,并通过ProcessBuilder来启动Driver}-->Driver进程
    2.Master LaunchExecutor-->Worker{[ExecutorRunner]内部通过Thread来处理Executor的启动-->创建Executor在本地文件系统的工作目录-->封装启动Executor的Command并使用ProcessBuilder来启动Executor}-->Executor进程:ExecutorBackend(CoarseGrainedExecutorBackend)
    3.Executor进程-->(Executor向Driver注册给SchedulerBackend)-->Driver
(四).Worker启动Driver
1.Cluster中Driver失败的时候,如果supervise为true,则启动该Driver的Worker会负责重新启动该Driver
2.DriveRunner启动进程是通过ProcessBuilder中的process.get.waitFor()来完成

参考Worker启动log如下:

16/08/27 10:33:34 INFO Utils: Successfully started service 'sparkWorker' on port 7078.
16/08/27 10:33:34 INFO Worker: Starting Spark worker 192.168.0.3:7078 with 5 cores, 4.0 GB RAM
16/08/27 10:33:34 INFO Worker: Running Spark version 1.6.1
16/08/27 10:33:34 INFO Worker: Spark home: /opt/single/spark
16/08/27 10:33:35 INFO Utils: Successfully started service 'WorkerUI' on port 8011.
16/08/27 10:33:35 INFO WorkerWebUI: Started WorkerWebUI at http://192.168.0.3:8011
16/08/27 10:33:35 INFO Worker: Connecting to master hadoop:7077...
16/08/27 10:33:35 INFO Worker: Successfully registered with master spark://hadoop:7077
16/08/27 12:03:56 INFO Worker: Asked to launch executor app-20160827120356-0000/0 for Spark shell
16/08/27 12:03:56 INFO SecurityManager: Changing view acls to: hadoop
16/08/27 12:03:56 INFO SecurityManager: Changing modify acls to: hadoop
16/08/27 12:03:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/08/27 12:03:56 INFO ExecutorRunner: Launch command: "/usr/local/java/jdk1.7.0_80/bin/java" "-cp" "/opt/single/spark/conf/:/opt/single/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar:/opt/single/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/single/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/single/spark/lib/datanucleus-core-3.2.10.jar" "-Xms4096M" "-Xmx4096M" "-Dspark.driver.port=35366" "-XX:+PrintGCDetails" "-Dkey=value" "-Dnumbers=one two three" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:35366" "--executor-id" "0" "--hostname" "192.168.0.3" "--cores" "5" "--app-id" "app-20160827120356-0000" "--worker-url" "spark://[email protected]:7078"
其中可以通过一下命令查看driver的相关信息

hadoop@hadoop:logs$ ps -aux | grep 35366
hadoop   10742  0.4  2.5 8558532 418308 ?      Sl   12:03   0:24 /usr/local/java/jdk1.7.0_80/bin/java -cp /opt/single/spark/conf/:/opt/single/spark/lib/spark-assembly-1.6.1-hadoop2.7.2.jar:/opt/single/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/single/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/single/spark/lib/datanucleus-core-3.2.10.jar -Xms4096M -Xmx4096M -Dspark.driver.port=35366 -XX:+PrintGCDetails -Dkey=value -Dnumbers=one two three -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:35366 --executor-id 0 --hostname 192.168.0.3 --cores 5 --app-id app-20160827120356-0000 --worker-url spark://[email protected]:7078
hadoop   15493  0.0  0.0  15972  2208 pts/30   S+   13:29   0:00 grep --color=auto 35366
hadoop@hadoop:logs$ netstat -apn | grep 35366
(并非所有进程都能被检测到,所有非本用户的进程信息将不会显示,如果想看到所有信息,则必须切换到 root 用户)
tcp6       0      0 192.168.0.3:35366       :::*                    LISTEN      10634/java      
tcp6       0      0 192.168.0.3:41088       192.168.0.3:35366       ESTABLISHED 10742/java      
tcp6       0      0 192.168.0.3:35366       192.168.0.3:41088       ESTABLISHED 10634/java 

四. Executor的详细注册与启动过程

================================================================

(一).Executor启动/注册:
Master发指令给Worker启动Executor(启动时机)
-->1.Worker接受指令通过ExecutorRunner启动一个进程运行Executor
-->2.CoarseGrainedExecutorBackend通过发送RegisterExecutor向Driver注册ExecutorBackend(与Executor无关)
-->3.Driver在DriverEndpoint中接收RegisterExecutor信息完成注册(实质:注册给CoarseGrainedSchedulerBackend),返回RegisterExecutor信息给CoarseGrainedExecutorBackend
-->4.CoarseGrainedExecutorBackend[Executor:真正负责Task计算]
-->5.Executor实例化:实例化一个ThreadPool用于Task的计算
-->6.ThreadPool:以多线程并发执行和线程复用的方式高效执行Task(如下方式)
(二).Executor执行Task:
Driver向ExecutorBackend发送LaunchTask(运行时机)
-->1.Executor:launchTask执行任务-->2.ThreadPool把Task封装在TaskRunner里
-->3.TaskRunner:java接口Runnable具体实现,真正工作的时交给线程池中的线程去运行,即:调用Run方法来执行Task
-->4.调用Task的run方法:Task实现类ShuffleMapTask/ResultTask,根据情况调用相应runTask

注:在Driver进程中有两个重要的Endpoint
a)ClientEndpoint:向Master注册当前的程序,是AppClient的内部成员;
b)DriverEndpoint:整个程序运行时候的驱动器,是CoarseGrainedExecutorBackend的内部成员;


五. App/Action/Job/DAG/Stage/TaskScheduler/TaskSetManager/TaskSet关系和过程

================================================================

(一).关系:
1.App含有0-n个Action,对应触发0-n个Job,Job根据依赖划分1-n个Stage组成DAG,1个Stage对应1个TaskSet
2.TaskScheduler不同Stage的TaskSet创建和维护1个TaskSetManager
追踪任务本地性以及错误信息
遇到Straggle任务会放到其他的节点进行重试;
向DAGScheduler汇报执行情况,包括在Shuffle输出lost的时候报告fetch failed错误等信息;
(二)过程:
1.启动Driver后产生Action-->submitJob-->EvenLoop.doOnReceive-->DAGSchedulerEventProcessLoop-->创建finalStage,建立父Stage依赖-->形成DAG(DAGSCheduler计算数据本地性借助getPreferedLocations优化效率)-->SparkContext提交Stage给TaskScheduler
-->2.TaskScheduler(TaskSchedulerImpl)内部会握有SchedulerBackend(SparkDeploySchedulerBackend)-->启动ClientEndpoint消息循环体-->ClientEndpoint向Master注册当前程序
-->3.Master:接受ClientEndpoint的注册信息后,运行程序
    -->生成Job ID;通过schedule分配计算资源:通过运行方式,Memory,cores等配置信息来决定
    -->Master发送指令给Worker
-->4.Worker[ExecutorRunner]:为程序分配计算资源时,分配ExecutorRunner
    -->ExecutorRunner通过Thread方式构建ProcessBuilder-->启动另外一个JVM进程Executor
-->5.Executor:通过ThreadPool通过线程复用的方式并发执行Task
其中Executor JVM启动:
1.加载创建ClientEndpoint时传入Command指定的类即CoarseGrainedExecutorBackend类main()
    -->实例化CoarseGrainedExecutorBackend本身
    -->回调onStart():向DriverEndpoint发送RegisterExecutor来注册当前CoarseGrainedExecutorBackend
-->3.DriverEndpoint把注册信息保存在SparkDeploySchedulerBackend内存数据结构中
    -->Driver获得计算资源,发送RegisteredExecutor给CoarseGrainedExecutorBackend
-->4.CoarseGrainedExecutorBackend收到RegisteredExecutor,通过Executor执行Task


六. Task执行

================================================================

概括:

LaunchTask-->反序列化TaskDescription-->Executor.launchTask-->Executor通过TaskRunner在ThreadPool中运行Thread的run方法
-->TaskRuner.run{statusUpdata汇报状态RUNNING给Driver,反序列化Task依赖,网络下载依赖,反序列化Task本身,Task.run调用runTask()}-->RDD反序列化,RDD.iterator对Task的Partition迭代后给function计算处理
总体流程:(三次反序列化)
1.ShuffleMapTask.runTask-->计算compute具体Partition-->通过ShuffleManager获得ShuffleWriter,把Task计算结果根据ShuffleManager的实现写入具体文件中-->把MapStatus发送给DAGScheduler具体为MapOutputTracker
-->2.Driver[DAGScheduler(MapOutputTracker]把ShuffleMapTask执行结果交给ResaultTask
-->3.ResaultTask根据前面Stage的执行结果进行Shuffle,序列化serializedResult,根据resultSize使用不同方式传回给Driver
-->4.DriverEndpoint会把result传给TaskSchedulerImpl,TaskResultGetter通过线程处理不同情况,告诉DAGScheduler任务处理结束的状况


七. BlockManager

================================================================

相关概念:
1.Application启动-->SparkEnv注册BlockManagerMaster(整个集群的Block数据进行管理)+MapOutputTracker(跟踪所有的Mapper的输出);
2.BlockManagerMasterEndpoint通过远程消息通信管理多有节点的BlockManager
每启动一个ExecutorBackend都会实例化一个BlockManager-->远程通信的方式注册给BolckManagerMaster,实质:Executor中的BlockManager在启动的时候注册给Driver上的BlockManagerMasterEndpoint,BlockManagerMaster会为其创建BlockManagerInfo来进行元数据管理
3.MemoryStore是BlockManager中专门负责内存数据存储和读写的类
4.DiskStore是BlockManager中专门负责基于磁盘的数据存储和读写的类
5.DiskBlockManager管理Logical Block与Disk上Physical Block之间的映射关系并负责磁盘文件的创建,管理磁盘读写
6.Block是spark应用程序运行角度数据最小抽象单位park,Block可以存储在memory,disk,off-heap等存储中

结构流程:

Driver{
	BlockManagerMaster{
		BlockManagerMasterEndpoint[
			BlockManagerInfo(BlockStatus)
		]
	}
	BlockManager(MemoryStore,DiskStore)
}

ExecutorBackend{
	Executor[
		BlockManager(MemoryStore,DiskStore);
		(BlockManagerWorker,BlockTransferService)
	]
}

Executor[BlockManager]-->BlockManagerMasterEndpoint

Executor1(BlockTransferService)<--(网络连通另外一个Executor读写操作)-->Executor2(BlockTransferService)
Executor1(BlockTransferService)--(执行Replication操作)-->Executor2
MemoryStore来存储广播变量
Driver通过BlockManagerInfo来管理集群中每个ExecutorBackend对应BlockManager的元数据信息;
ExecutorBackend上的信息改变后必须发消息给Driver中的BlockManagerMaster来更新相应的BlockManagerInfo的信息;
执行第二个Stage时,向Driver的MapOutputTrackerMansterEndpoint发消息请求上一个Stage中输出-->MapOutputTrackerMaster把上一个Stage输出数据的元数据信息发送给当前请求的Stage

Executor实例化时:
BlockManager.initialize实例化Executor上的BlockManager-->创建BlockManagerSlaveEndpoint消息循环体(用于接收Driver中的BlockManagerMaster发过来的指令,例如删除Block等)
-->2.Executor上的BlockManager需要向Driver上的BlockManagerMasterEndpoint注册-->3.BlockManagerMasterEndpoint接收到注册信息,进行处理

RDD操作部分概括:

RDD Operations:
Create RDD:
textFile():HadoopRDD,分布式分片存储
sc.parallelize(List("a","b"))
RDD Tansformation(Lazy)
map():MapPatitionRDD,基于HadopRDD产生的Partition去掉KEY
filter(func) ,过滤 func:_.contains(".."),_.startwith("..")
flatMap(x=>x.split(" ")):MapPatitionRDD,对每一个Partiiton操作合并为一个大的集合
mapPartitions(func) ,mapPartitionsWithIndex(func) ,sample(withReplacement, fraction, seed)
union(otherDataset) ,合并 
intersection(otherDataset) ,
distinct([numTasks])) ,去除重复
groupByKey([numTasks]) ,分组
reduceByKey(func, [numTasks]) ,按照key进行reduce合并
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) ,
sortByKey([ascending], [numTasks]) ,按照key进行sort
join(otherDataset, [numTasks]) ,关联
cogroup(otherDataset, [numTasks]) ,cartesian(otherDataset) ,pipe(command, [envVars]) ,coalesce(numPartitions) ,repartition(numPartitions) , repartitionAndSortWithinPartitions(partitioner) 
RDD Action:
reduce((_+_))
reduceByKey(_ + _):ShuffledRDD,全局的reduce,发生shuffle,产生两个stage
collect() :转换成一个集合
count() 计数
first() 第一个元素
take(n) 返回n行/个元素
takeSample(withReplacement, num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path) 文本文件保存
saveAsSequenceFile(path)(Java and Scala) 
saveAsObjectFile(path)(Java and Scala) 
countByKey() 对key进行count
foreach(func) 遍历
其他操作:
rdd.chache
二次排序Demo:
sc.setLogLevel("WARN")
sc.textFile("file:///home/hadoop/test/data").flatMap(_.split(" "))
.map(word=>(word,1)).reduceByKey(_+_,1)
.map(pair => (pair._2,pair._1)).sortByKey(false).map(pair => (pair._2,pair._1)).collect()
OrderedRDDFunctions.sortByKey()
TopN
sc.textFile("file:///home/hadoop/test/data/data2")
.map(line => (line.split("\t")(1).toInt, line.split("\t")(0).toString))
.sortByKey(false)
.map(item=>(item._2,item._1)) take (5).foreach(println)

未完



你可能感兴趣的:(spark)