spark核心概念

Application:基于Spark的应用程序 = 1 driver + executors

    User program built on Spark.  //用户程序构建在spark上
    Consists of a driver program and executors on the cluster.//
    spark0402.py
    pyspark/spark-shell

Driver program

    The process running the main() function of the application 
    creating the SparkContext   //是一个进程,用来运行应用的main方法,来创建一个sparkcontext

Cluster manager //集群的资源获取管理

    An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)   //一个外部的服务,从集群上获取资源()
    spark-submit --master local[2]/spark://hadoop000:7077/yarn

Deploy mode //部署模式

    Distinguishes where the driver process runs. //区分你的driver进程运行在哪里
        In "cluster" mode, the framework launches the driver inside of the cluster. //你的框架会启动你的driver在集群里,并且运行在am里
        In "client" mode, the submitter launches the driver outside of the cluster. //你的提交者会启动你的driver在集群外面,在本节点启动

Worker node //工作节点

    Any node that can run application code in the cluster //运行你的应用程序在集群里
    standalone: slave节点 slaves配置文件
    yarn: nodemanager

Executor //一个进程,运行在worker node上面的,运行task,缓存数据,而且每个应用程序有一组独立的executor

    A process launched for an application on a worker node
    runs tasks 
    keeps data in memory or disk storage across them  //能够对数据进行缓存,存到内存或者磁盘里面
    Each application has its own executors. //每个应用程序都有它独立的executor

申请资源,通过cluster manager来申请,可以指定yarn,standalone,local等等,可以采用本地client方法,也可以cluster(deploy mode)

Task //工作单元,从driver发起,通过网络,会被发送到executor去执行

    A unit of work that will be sent to one executor    

Job //一个并行计算,这个计算由多个task构成

    A parallel computation consisting of multiple tasks that //job包含了多个task
    gets spawned in response to a Spark action (e.g. save, collect); //lazy的,遇到action了才会到集群上运行变成job
    you'll see this term used in the driver's logs.
    一个action对应一个job

Stage

    Each job gets divided into smaller sets of tasks called stages //一个job会被拆分成很多小的任务集,叫stage
    that depend on each other                                      //它们之间是有相互的依赖的
    (similar to the map and reduce stages in MapReduce);           //类似map和reduce的stage
    you'll see this term used in the driver's logs.                //你能够在driver的日志里查看到
    一个stage的边界往往是从某个地方取数据开始,到shuffle的结束

一个应用程序由1个driver和多个executor组成,executor运行在worker node上面,executor上面会有一堆的task,这些task是从driver发过来,这些task是遇到一个job的时候触发的,什么是job呢,遇到action时触发的,一个job又会被拆分成很多个task子集,就是stage,task是最小运行单元,在运行时通过不同的运行模式(cluster manager)指定deploy mode到底是client还是cluster来运行

一个job是action触发的,然后一个job里面可能会有一到多个stage,然后stage里有一堆task,这些task运行在executor里面

    executor运行在worker node上面

Spark Cache
rdd.cache(): StorageLevel

cache它和tranformation: lazy   没有遇到action是不会提交作业到spark上运行的

如果一个RDD在后续的计算中可能会被使用到,那么建议cache

cache底层调用的是persist方法,传入的参数是:StorageLevel.MEMORY_ONLY
cache=persist

unpersist: 立即执行的

窄依赖:一个父RDD的partition之多被子RDD的某个partition使用一次

宽依赖:一个父RDD的partition会被子RDD的partition使用多次,有shuffle

hello,1
hello,1       hello
world,1

hello,1       world
world,1

你可能感兴趣的:(spark核心概念)