Spark超时重启、自动拉起脚本(Python版)

程序说明

无自动拉起脚本

  1. Spark Streaming程序往往不稳定,遇到一些网络延迟或者部分节点挂掉会导致程序批次时间过长,从而影响程序的实时性。
  2. Spark Streaming 程序会因为一些不可控因素宕机,影响数据处理。

该脚本作用

  1. 会根据Spark在Yarn中的任务ID去判断程序是否宕机,如果程序宕机,则启动Spark程序。
  2. 会根据Yarn的Web页面获取当前批次的处理时长,如果批次执行时间过长的话就会重启Spark程序。

程序设计

  1. 获取程序的ApplicationId
  • shell
yarn application -list | grep Kafka2Spark2HBase
  • 通过Python自带subprocess库调用shell命令并读取shell命令结果,获取结果后取返回值的第0个位置就是applicationId
# -*- coding: utf-8 -*-
import subprocess

application = subprocess.Popen('yarn application -list | grep Kafka2Spark2HBase ', shell=True, stdout=subprocess.PIPE)
application.wait()
line = str(application.stdout.read())
applicationId = line.split(' ', 1)[0]
print line
print applicationId
  • 参数说明

Kafka2Spark2HBase是我的Spark Streaming程序名称,这里替换为自己的程序入口
如果获取为空,这说明程序未启动,此时触发宕机拉起程序(详细代码里有)

  1. 根据ApplicationId获取Yarn的Web页面并获取当前运行批次的耗时
url = 'http://ip:8088/proxy/' + applicationId
page = urllib2.urlopen(url)
contents = page.read()
# 获得了整个网页的源代码
soup = BeautifulSoup(contents, "html.parser")

table = soup.find('table', id='activeJob-table')
tbody = table.find('tbody')
tds = tbody.find_all('td')
job_id =  int(tds[0].get_text().strip())
job_time =  tds[3].get_text().strip()
time_num = float(job_time.split(' ', 1)[0])
#time_unit = 'min'
time_unit = job_time.split(' ', 1)[1]
print job_id, job_time

完整代码(Python版)

# -*- coding: utf-8 -*-
import urllib2
import os
import time
import subprocess
from bs4 import BeautifulSoup

application = subprocess.Popen('yarn application -list | grep Kafka2Spark2HBase ', shell=True, stdout=subprocess.PIPE)
application.wait()
line = str(application.stdout.read())
applicationId = line.split(' ', 1)[0]

if len(applicationId) == 0:
    print '============================='
    print time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
    os.system('yarn application -kill ' + applicationId + ' ')
    os.system('/home/spark/streaming-start.sh ')
    exit("程序宕机,已重新拉起")

def case_ms(t):
    return True

def case_s(t):
   # t = float(t)
    if t < 59:
	return True
def case_min(t):
    return False

def case_h(t):
    return False

def default(t):
    return False

url = 'http://tz-nn-01.rdsp.com:8088/proxy/' + applicationId
page = urllib2.urlopen(url)
contents = page.read()
# 获得了整个网页的源代码
soup = BeautifulSoup(contents, "html.parser")

table = soup.find('table', id='activeJob-table')
tbody = table.find('tbody')
tds = tbody.find_all('td')
job_id =  int(tds[0].get_text().strip())
job_time =  tds[3].get_text().strip()
time_num = float(job_time.split(' ', 1)[0])
#time_unit = 'min'
time_unit = job_time.split(' ', 1)[1]

flag = False
if job_id > 0:
    if time_unit == 'ms':
        flag = case_ms(time_num)
    if time_unit == 's':
        flag = case_s(time_num)
    if time_unit == 'min':
        flag = case_min(time_num)
    if time_unit == 'h':
        flag = case_h(time_num)

if job_id == 0:
    flag = True
    if time_unit == 'h':
        if time_num > 2:
            flag = False

print time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), flag, applicationId, job_id, job_time

if flag is False:
    print '========================================'
    print time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), flag, applicationId, job_id, job_time
    os.system('yarn application -kill ' + applicationId + ' ')
    os.system('/home/spark/streaming-start.sh ')
    print '批次超时,已重启。批次ID:', job_id
    print '========================================'

你可能感兴趣的:(Python,Spark,大数据)