Flink源码学习之Flink主节点启动

Flink主节点启动

ClusterEntryPoint:集群启动入口

Flink主从架构:主节点JobManager + 从节点 : TaskManager

JobManager是Flink集群的主节点,主要包括三大组件:

1. ResourceManager
	Flink的集群资源管理器,只有一个,负责Slot的管理和申请等工作,也负责心跳服务
	
2. Dispatcher
	负责接收用户提交的JobGraph,然后启动一个JobMaster,JobMaster类似于Yarn集群中的AppMaster角色。内部有一个持久服务:JobGraphStore,用来存储提交到JobManager的Job的信息,也用作主节点宕机之后做Job恢复用
	
3. WebMonitorEndpoint Rest服务器 = netty服务器
	里面维护了很多很多的 Handler,也还会启动一个 Netty 服务端,用来接收外部的 rest 请求。
	如果客户端通过 flink run 的方式来提交一个 job 到 flink 集群,最终,
	是由 WebMonitorEndpoint 来接收处理,经过路由解析处理之后决定使用哪一个 Handler 来执行处理
	例如:submitJob ===> JobSubmitHandler
		Router 路由器 绑定了一大堆 Handler

​ 当 client 通过 rest 方式提交一个 job 到集群运行的时候(客户端会把该 Job 构建成一个 JobGragh 对象)是由 WebMonitorEndpoint 来接收处
理的,WebMonitorEndpoint 内部会通过 Router 进行路由解析找到对应的 Handler 来执行处理,处理完毕之后转交给 Dispatcher,Dispatcher 负责
拉起 JobMaster 来负责这个 Job 的 Slot 资源申请 和 Task 的部署执行,关于 Job 执行过程中,所需要的 Slot 资源,由 JobMaster 向
ResourceManager 申请。

​ JobManger的启动主类为:ClusterEntrypoint

因为多种组合都会调用ClusterEntrypoint.runClusterEntrypoint类来启动

//启动集群
ClusterEntryPoint.runClusterEntrypoint(entrypoint)
	//启动组件,配置文件系统实例等
    ClusterEntryPoint.startCluster();
		runCluster(configuration,pluginManager)
            //第一步:初始化各种服务(8个基础服务)比较重要的有:HAService,RpcService,HeartbeatServices
            initializeServices(configuration,pluginManager)
            
            //创建组件实例的工厂类,主要包含了如下三个工厂实例
            // ResourceManager的工厂实例
            //Dispatcher的工厂实例
            //WebMonitorEndpoint的工厂实例
            createDispatcherResourceManagerComponentFactory(configuration);
			
			// 创建 集群运行需要的一些组件:WebMonitorEndpoint,Dispatcher, ResourceManager 等
			// 创建和启动 ResourceManager
			// 创建和启动 Dispatcher
			// 创建和启动 WebMonitorEndpoint
			clusterComponent = dispatcherResourceManagerComponentFactory.create(...)

第一步中的InitializeServices()中的代码:

 			// TODO: 2021/7/2 启动RPCService
            commonRpcService =
                    AkkaRpcServiceUtils.createRemoteRpcService(
                            configuration,
                            configuration.getString(JobManagerOptions.ADDRESS),
                            getRPCPortRange(configuration),
                            configuration.getString(JobManagerOptions.BIND_HOST),
                            configuration.getOptional(JobManagerOptions.RPC_BIND_PORT));

            // TODO: 2021/7/2 启动一个 JMXService,用于客户端链接 JobManager JVM 进行监控
            JMXService.startInstance(configuration.getString(JMXServerOptions.JMX_SERVER_PORT));

            // update the configuration used to create the high availability services
            configuration.setString(JobManagerOptions.ADDRESS, commonRpcService.getAddress());
            configuration.setInteger(JobManagerOptions.PORT, commonRpcService.getPort());

            // TODO: 2021/7/2 启动一个线程池 ,默认大小是CPU核数*4
            ioExecutor =
                    Executors.newFixedThreadPool(
                            ClusterEntrypointUtils.getPoolSize(configuration),
                            new ExecutorThreadFactory("cluster-io"));
            // TODO Spiral: 2021/7/2 启动高可用服务
            haServices = createHaServices(configuration, ioExecutor);

            // TODO: 2021/7/2 主要管理一些大文件的上传等
            blobServer = new BlobServer(configuration, haServices.createBlobStore());
            blobServer.start();
            // TODO: 2021/7/2  心跳服务
            //心跳组件
            // TODO: 2021/7/2 ResourceManager和TaskExecutor
            // TODO: 2021/7/2 ResourceManger和JobMaster
            // TODO: 2021/7/2 JobMaster和TaskExecutor
            heartbeatServices = createHeartbeatServices(configuration);
            // TODO: 2021/7/2 性能监控相关的服务
            metricRegistry = createMetricRegistry(configuration, pluginManager);

            final RpcService metricQueryServiceRpcService =
                    MetricUtils.startRemoteMetricsRpcService(
                            configuration, commonRpcService.getAddress());
            metricRegistry.startQueryService(metricQueryServiceRpcService, null);

            final String hostname = RpcUtils.getHostname(commonRpcService);

            processMetricGroup =
                    MetricUtils.instantiateProcessMetricGroup(
                            metricRegistry,
                            hostname,
                            ConfigurationUtils.getSystemResourceMetricsProbingInterval(
                                    configuration));
            //TODO 物理执行图存储服务 在Flink1.13才是这个名字,在1.12时是archivedExecutionGraphStore
            executionGraphInfoStore =
                    createSerializableExecutionGraphStore(
                            configuration, commonRpcService.getScheduledExecutor());	

第二步 createDispatcherResourceManagerComponentFactory(configuration) 中负责初始化了很多组件的工厂实例:

1、DispatcherRunnerFactory,默认实现:DefaultDispatcherRunnerFactory,生产 DefaultDispatcherRunner
2、ResourceManagerFactory,默认实现:StandaloneResourceManagerFactory,生产 StandaloneResourceManager
3、RestEndpointFactory,默认实现:SessionRestEndpointFactory,生产 DispatcherRestEndpoint

第三步 dispatcherResourceManagerComponentFactory.create(…) 中主要去创建 三个重要的组件:

1、DispatcherRunner,实现是:DefaultDispatcherRunner
2、ResourceManager,实现是:StandaloneResourceManager
3、WebMonitorEndpoint,实现是:DispatcherRestEndpoint

启动流程如下图:
Flink源码学习之Flink主节点启动_第1张图片

WebMonitorEndpoint初始化和启动

入口:

DispatcherResourceManagerComponentFactory.create(....)

执行代码如下:

// TODO: 2021/7/6 webMonitorEndpoint初始化
webMonitorEndpoint =
     restEndpointFactory.createRestEndpoint(
            configuration,
            dispatcherGatewayRetriever,
            resourceManagerGatewayRetriever,
            blobServer,
            executor,
            metricFetcher,
            highAvailabilityServices.getClusterRestEndpointLeaderElectionService(),
            fatalErrorHandler);
//启动webMonitorEndpoint
webMonitorEndpoint.start();

start(){
    synchronized (lock) {
            Preconditions.checkState(
                    state == State.CREATED, "The RestServerEndpoint cannot be restarted.");

            log.info("Starting rest endpoint.");
            // TODO: 2021/7/6 初始化路由器
            final Router router = new Router();
            final CompletableFuture<String> restAddressFuture = new CompletableFuture<>();
            //初始化事件处理器
            handlers = initializeHandlers(restAddressFuture);

            /* sort the handlers such that they are ordered the following:
             * /jobs
             * /jobs/overview
             * /jobs/:jobid
             * /jobs/:jobid/config
             * /:*
             */
            // TODO: 2021/7/6 对处理器进行排序
            Collections.sort(handlers, RestHandlerUrlComparator.INSTANCE);

            checkAllEndpointsAndHandlersAreUnique(handlers);
            // TODO: 2021/7/6 向路由器注册处理器
            handlers.forEach(handler -> registerHandler(router, handler, log));

            ChannelInitializer<SocketChannel> initializer =
                    new ChannelInitializer<SocketChannel>() {

                        @Override
                        protected void initChannel(SocketChannel ch) {
                            RouterHandler handler = new RouterHandler(router, responseHeaders);

                            // SSL should be the first handler in the pipeline
                            if (isHttpsEnabled()) {
                                ch.pipeline()
                                        .addLast(
                                                "ssl",
                                                new RedirectingSslHandler(
                                                        restAddress,
                                                        restAddressFuture,
                                                        sslHandlerFactory));
                            }

                            ch.pipeline()
                                    .addLast(new HttpServerCodec())
                                    .addLast(new FileUploadHandler(uploadDir))
                                    .addLast(
                                            new FlinkHttpObjectAggregator(
                                                    maxContentLength, responseHeaders))
                                    .addLast(new ChunkedWriteHandler())
                                    .addLast(handler.getName(), handler)
                                    .addLast(new PipelineErrorHandler(log, responseHeaders));
                        }
                    };

            NioEventLoopGroup bossGroup =
                    new NioEventLoopGroup(
                            1, new ExecutorThreadFactory("flink-rest-server-netty-boss"));
            NioEventLoopGroup workerGroup =
                    new NioEventLoopGroup(
                            0, new ExecutorThreadFactory("flink-rest-server-netty-worker"));

            bootstrap = new ServerBootstrap();
            bootstrap
                    .group(bossGroup, workerGroup)
                    .channel(NioServerSocketChannel.class)
                    .childHandler(initializer);
            // TODO: 2021/7/6 生成netty服务的端口号
            Iterator<Integer> portsIterator;
            try {
                portsIterator = NetUtils.getPortRangeFromString(restBindPortRange);
            } catch (IllegalConfigurationException e) {
                throw e;
            } catch (Exception e) {
                throw new IllegalArgumentException(
                        "Invalid port range definition: " + restBindPortRange);
            }

            int chosenPort = 0;
            // TODO: 2021/7/6 尝试netty绑定端口号,直至绑定成功或所有提供的端口号都不可用
            while (portsIterator.hasNext()) {
                try {
                    chosenPort = portsIterator.next();
                    final ChannelFuture channel;
                    if (restBindAddress == null) {
                        channel = bootstrap.bind(chosenPort);
                    } else {
                        channel = bootstrap.bind(restBindAddress, chosenPort);
                    }
                    serverChannel = channel.syncUninterruptibly().channel();
                    break;
                } catch (final Exception e) {
                    // continue if the exception is due to the port being in use, fail early
                    // otherwise
                    if (!(e instanceof org.jboss.netty.channel.ChannelException
                            || e instanceof java.net.BindException)) {
                        throw e;
                    }
                }
            }

            if (serverChannel == null) {
                throw new BindException(
                        "Could not start rest endpoint on any port in port range "
                                + restBindPortRange);
            }

            log.debug("Binding rest endpoint to {}:{}.", restBindAddress, chosenPort);

            final InetSocketAddress bindAddress = (InetSocketAddress) serverChannel.localAddress();
            final String advertisedAddress;
            if (bindAddress.getAddress().isAnyLocalAddress()) {
                advertisedAddress = this.restAddress;
            } else {
                advertisedAddress = bindAddress.getAddress().getHostAddress();
            }
            final int port = bindAddress.getPort();

            log.info("Rest endpoint listening at {}:{}", advertisedAddress, port);

            restBaseUrl = new URL(determineProtocol(), advertisedAddress, port, "").toString();

            restAddressFuture.complete(restBaseUrl);

            state = State.RUNNING;
 			// TODO: 2021/7/6 启动endPoint的onStart()方法,
        	//这里完成了选举任务,并将主节点的信息放在ZooKeeper中,并启动了清理ExecutionGraph的任务 
            startInternal();
        }
}


启动工作流程:

  1. 初始化一个Router以及很多的Handler,将handler进行去重排序后,分别注册到Router中
  2. 启动一个Netty的服务端
  3. 启动内部服务:进行竞选!并经主节点的信息放到ZooKeeper中
  4. 启动一个ExecutionGraph的清理任务

ResourceManger初始化和启动

核心入口:

DispatcherResourceManagerComponentFactory.create(....)
            // TODO: 2021/7/6 初始化resourceManager
            resourceManager =
                    resourceManagerFactory.createResourceManager(
                            configuration,
                            ResourceID.generate(),
                            rpcService,
                            highAvailabilityServices,
                            heartbeatServices,
                            fatalErrorHandler,
                            new ClusterInformation(hostname, blobServer.getPort()),
                            webMonitorEndpoint.getRestBaseUrl(),
                            metricRegistry,
                            hostname,
                            ioExecutor);

			resourceManager.start();//调用ResourceManager的onStart方法

			//ResourceManger真正启动的方法
			ResourceManager.onStart(){
            	log.info("Starting the resource manager.");
            	// TODO: 2021/7/6 启动ResourceManager服务 ,这里主要启动选举服务,在选举完成后,
            	//  会在主节点上启动两个心跳服务和两个slot管理的服务
            	startResourceManagerServices();
            }
			
			startResourceManagerServices() throws Exception {
        		try {
            		// TODO: 2021/7/6 获取一个选举服务  
            		leaderElectionService =
                    	highAvailabilityServices.getResourceManagerLeaderElectionService();

            		initialize();
					//开始选举,在选举成功后调用ResourceManager的grantLeadership()f方法
            		leaderElectionService.start(this);
            		jobLeaderIdService.start(new JobLeaderIdActionsImpl());

           			 registerMetrics();
        		} catch (Exception e) {
            		handleStartResourceManagerServicesException(e);
       		 }
    	}


		//选举成功后调用的方法
		grandLeadership(){
              final CompletableFuture<Boolean> acceptLeadershipFuture =
                clearStateFuture.thenComposeAsync(
                  //todo 这里启动Heartbeat服务以及slotManager服务
                        (ignored) -> tryAcceptLeadership(newLeaderSessionID),
                        getUnfencedMainThreadExecutor());

        final CompletableFuture<Void> confirmationFuture =
                acceptLeadershipFuture.thenAcceptAsync(
                        (acceptLeadership) -> {
                            if (acceptLeadership) {
                                // confirming the leader session ID might be blocking,
                                leaderElectionService.confirmLeadership(
                                        newLeaderSessionID, getAddress());
                            }
                        },
                        ioExecutor);

        confirmationFuture.whenComplete(
                (Void ignored, Throwable throwable) -> {
                    if (throwable != null) {
                        onFatalError(ExceptionUtils.stripCompletionException(throwable));
                    }
                });
        }

     	// 
		tryAcceptLeadership(newLeaderSessionID){
          // TODO: 2021/7/6 在Leader节点启动心跳服务以及slotManager服务
          startServicesOnLeadership();   
        }
			
    startServicesOnLeadership() {
        //开启心跳服务
        startHeartbeatServices();

        slotManager.start(getFencingToken(), getMainThreadExecutor(), new ResourceActionsImpl());

        onLeadership();
    }

    startHeartbeatServices() {
        // TODO: 2021/7/6 启动向TaskManager发送心跳信息的sender,用线程池启动线程的方法
        taskManagerHeartbeatManager =
                heartbeatServices.createHeartbeatManagerSender(
                        resourceId,
                        new TaskManagerHeartbeatListener(),
                        getMainThreadExecutor(),
                        log);
        
		// TODO: 2021/7/6 启动向JobManager发送心跳信息的sender
        jobManagerHeartbeatManager =
                heartbeatServices.createHeartbeatManagerSender(
                        resourceId,
                        new JobManagerHeartbeatListener(),
                        getMainThreadExecutor(),
                        log);
    }


	slotManager.start(getFencingToken(), getMainThreadExecutor(), new ResourceActionsImpl()){
        this.resourceManagerId = Preconditions.checkNotNull(newResourceManagerId);
        mainThreadExecutor = Preconditions.checkNotNull(newMainThreadExecutor);
        resourceActions = Preconditions.checkNotNull(newResourceActions);

        started = true;
		// TODO: 2021/7/6 启动定时检查taskManager是否超时的任务 
        taskManagerTimeoutsAndRedundancyCheck =
                scheduledExecutor.scheduleWithFixedDelay(
                        () ->
                                mainThreadExecutor.execute(
                                        () -> checkTaskManagerTimeoutsAndRedundancy()),
                        0L,
                        taskManagerTimeout.toMilliseconds(),
                        TimeUnit.MILLISECONDS);
		// TODO: 2021/7/6 启动检查slot申请是否超时的任务
        slotRequestTimeoutCheck =
                scheduledExecutor.scheduleWithFixedDelay(
                        () -> mainThreadExecutor.execute(() -> checkSlotRequestTimeouts()),
                        0L,
                        slotRequestTimeout.toMilliseconds(),
                        TimeUnit.MILLISECONDS);

	}

启动流程分析:

1、ResourceManager 是 RpcEndpoint 的子类,所以在构建 ResourceManager 对象完成之后,肯定会调用 start() 方法来启动这个 RpcEndpoint,
然后就调准到它的 onStart() 方法执行。
2、ResourceManager 是 LeaderContender 的子类,会通过 LeaderElectionService 参加竞选,如果竞选成功,则会回调 isLeader() 方法。
3、启动 ResourceManager 需要的一些服务:
两个心跳服务
ResourceManager 和 TaskExecutor 之间的心跳
ResourceManager 和 JobMaster 之间的心跳
两个定时服务
checkTaskManagerTimeoutsAndRedundancy() 检查 TaskExecutor 的超时
checkSlotRequestTimeouts() 检查 SlotRequest 超时

Dispatcher初始化和启动

// 初始化 并启动 DispatcherRunner
dispatcherRunner = dispatcherRunnerFactory.createDispatcherRunner(
            highAvailabilityServices.getDispatcherLeaderElectionService(), fatalErrorHandler,
            // TODO_MA 注释: 注意第三个参数
            new HaServicesJobGraphStoreFactory(highAvailabilityServices),
            ioExecutor, rpcService, partialDispatcherServices
         );
dispatchper = createDispatcher();
dispatchper.start();
		// 监听响应
		nodeChanged();
			leaderElectionEventHandler.onLeaderInformationChange();
				// 写入 Leader 的信息到 ZK 中
				leaderElectionDriver.writeLeaderInformation(confirmedLeaderInfo);
		// 选举响应
		isLeader();
			leaderElectionEventHandler.onGrantLeadership();
				leaderContender.grantLeadership(issuedLeaderSessionID);	
					startNewDispatcherLeaderProcess(leaderSessionID)
       					stopDispatcherLeaderProcess();
						dispatcherLeaderProcess = createNewDispatcherLeaderProcess(leaderSessionID);
						newDispatcherLeaderProcess::start
         					startInternal()
           					onStart();
								// 启动 JobGraphStore
								startServices();
								// 恢复 Job
								recoverJobsAsync()

启动流程:

  1. 启动JobGraphStore服务
  2. 从JobGraphStrore恢复执行Job,要启动Disptcher

r.grantLeadership(issuedLeaderSessionID);
startNewDispatcherLeaderProcess(leaderSessionID)
stopDispatcherLeaderProcess();
dispatcherLeaderProcess = createNewDispatcherLeaderProcess(leaderSessionID);
newDispatcherLeaderProcess::start
startInternal()
onStart();
// 启动 JobGraphStore
startServices();
// 恢复 Job
recoverJobsAsync()




启动流程:

1. 启动JobGraphStore服务
2. 从JobGraphStrore恢复执行Job,要启动Disptcher



你可能感兴趣的:(大数据-flink,flink)