JStorm ProcessLauncher进程不退出的问题

ps -elf|grep Launcher|wc -l 发现有很多ProcessLauncher没有退出。
检查supervisor的线程数

jstack -l 44493|grep waitForProcessExit|wc -l

发现supervisor中也存在非常多的线程处于等待进程退出的状态。数量和ProcessLauncher的数量一致。

使用jstack工具,观察程序卡死位置。

"main" #1 prio=5 os_prio=0 tid=0x00007fc93c00a000 nid=0x7db9 runnable [0x00007fc9438fb000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	- locked <0x00000000fab1fd98> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:480)
	- locked <0x00000000fab1fce8> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
	- locked <0x00000000fab1ffc0> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
	at java.io.PrintStream.write(PrintStream.java:527)
	- locked <0x00000000fab1fce8> (a java.io.PrintStream)
	at java.io.PrintStream.print(PrintStream.java:669)
	at java.io.PrintStream.println(PrintStream.java:806)
	- locked <0x00000000fab1fce8> (a java.io.PrintStream)
	at com.alibaba.jstorm.utils.ProcessLauncher.main(ProcessLauncher.java:130)

或者是如下错误

   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	- locked <0x00000000fab1fe20> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:482)
	- locked <0x00000000fab1fd70> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
	- locked <0x00000000fab20048> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
	at java.io.PrintStream.write(PrintStream.java:527)
	- locked <0x00000000fab1fd70> (a java.io.PrintStream)
	at java.io.PrintStream.print(PrintStream.java:683)
	at ch.qos.logback.core.status.OnPrintStreamStatusListenerBase.print(OnPrintStreamStatusListenerBase.java:44)
	at ch.qos.logback.core.status.OnPrintStreamStatusListenerBase.addStatusEvent(OnPrintStreamStatusListenerBase.java:50)
	at ch.qos.logback.core.status.OnConsoleStatusListener.addStatusEvent(OnConsoleStatusListener.java:25)
	at ch.qos.logback.core.BasicStatusManager.fireStatusAddEvent(BasicStatusManager.java:87)
	- locked <0x00000000faea1948> (a ch.qos.logback.core.spi.LogbackLock)
	at ch.qos.logback.core.BasicStatusManager.add(BasicStatusManager.java:59)
	at ch.qos.logback.core.spi.ContextAwareBase.addStatus(ContextAwareBase.java:79)
	at ch.qos.logback.core.spi.ContextAwareBase.addInfo(ContextAwareBase.java:84)
	at ch.qos.logback.core.rolling.RollingPolicyBase.determineCompressionMode(RollingPolicyBase.java:50)
	at ch.qos.logback.core.rolling.TimeBasedRollingPolicy.start(TimeBasedRollingPolicy.java:62)
	at ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy.start(SizeAndTimeBasedRollingPolicy.java:23)
	at ch.qos.logback.core.joran.action.NestedComplexPropertyIA.end(NestedComplexPropertyIA.java:167)
	at ch.qos.logback.core.joran.spi.Interpreter.callEndAction(Interpreter.java:317)
	at ch.qos.logback.core.joran.spi.Interpreter.endElement(Interpreter.java:196)
	at ch.qos.logback.core.joran.spi.Interpreter.endElement(Interpreter.java:182)
	at ch.qos.logback.core.joran.spi.EventPlayer.play(EventPlayer.java:62)
	at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:149)
	- locked <0x00000000faea1b58> (a ch.qos.logback.core.spi.LogbackLock)
	at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:135)
	at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:99)
	at ch.qos.logback.core.joran.GenericConfigurator.doConfigure(GenericConfigurator.java:49)
	at ch.qos.logback.classic.util.ContextInitializer.configureByResource(ContextInitializer.java:75)
	at ch.qos.logback.classic.util.ContextInitializer.autoConfig(ContextInitializer.java:148)
	at org.slf4j.impl.StaticLoggerBinder.init(StaticLoggerBinder.java:85)
	at org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:55)
	at org.slf4j.LoggerFactory.bind(LoggerFactory.java:128)
	at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:107)
	at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:295)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:269)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:281)
	at backtype.storm.utils.Utils.(Utils.java:103)
	at com.alibaba.jstorm.utils.ProcessLauncher.main(ProcessLauncher.java:157)

代码卡在了System.out.println上。首先排除的是磁盘空间不足等问题。程序里其他线程都是守护线程,也不存在线程死锁的问题。
首先排查打开文件数量的问题

##查看系统总共打开文件数量限制
cat /proc/sys/fs/file-max
3247228
##查看文件数量限制
lsof|wc -l
1132023

总量显然小于限制值。当然大家必须知道lsof的具体用法因操作系统而有区别。比如在centos中,线程打开的文件都在lsof显示,所以上述数值存在较大重复计数。
查看不退出的进程打开的文件

java    3535 jboss5    0r  FIFO                0,8       0t0 2725987468 pipe
java    3535 jboss5    1w  FIFO                0,8       0t0 2725987469 pipe
java    3535 jboss5    2w  FIFO                0,8       0t0 2725987469 pipe

明显打开了标准输入,标准输出和错误输出。作为一个管道,管道的读取方就是父进程supervisor进程。0t0表示管道的offset为0(管道是一种环形数据结构)。这也就表示实际上管道并没有任何实际的写入。

在已经起来的worker上,观察到了进程内部的管道,似乎和eventpoll有关

java    52771 jboss5 2610u  a_inode                0,9         0       6372 [eventpoll]
java    52771 jboss5 2611r     FIFO                0,8       0t0 2460351397 pipe
java    52771 jboss5 2612w     FIFO                0,8       0t0 2460351397 pipe

无法解释,决定先手动杀掉processlauncher,解决问题。

ps -ef|grep -i processlauncher |grep -v grep|cut -c 9-15|xargs kill -9

杀完之后,世界清静了,继续开始启动我们的worker。然并卵,并启动比起来。这样可以判断,ProcessLauncher不退出是集群出现问题的果而不是因。
于是先排查线程数目

cat /proc/5056/status

结果显示有115个线程。

查看supervisor进程的GC情况,没有任何问题

$ jstat -gc 44493
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
52416.0 52416.0 472.2   0.0   419456.0 135049.7  524288.0   21829.5   17792.0 17285.1 1920.0 1796.8  10014  146.775   0      0.000  146.775

那就是处于什么原因,写入Launcher进程写入管道失败了。
这个可能是supervisor进程的原因,也有可能是launcher进程的原因。因为是生产环境的supervisor,我们只能先排查launcher进程。不是这两行报错吗,那就删掉。再把相关执行路径上的日志输出关掉。

public static void main(String[] args) throws Exception{
			System.out.println("Enviroment:" + System.getenv());
			System.out.println("Properties:" + System.getProperties());

重新打包JStorm-core的jar包,替换掉线上的包。问题圆满解决。

——————————————我是求真务实的分界线————————————————————————
我们测试了在阻塞情况下的文件读写,
文件可以正常读写。
用下面代码检查python管道是否能成功。结果操作系统的管道功能没有问题。

import os, time, sys
pipe_name = 'pipe_test'

def child( ):
    pipeout = os.open(pipe_name, os.O_WRONLY)
    counter = 0
    while True:
        time.sleep(1)
        os.write(pipeout, 'Number %03d\n' % counter)
        counter = (counter+1) % 5

def parent( ):
    pipein = open(pipe_name, 'r')
    while True:
        line = pipein.readline()[:-1]
        print 'Parent %d got "%s" at %s' % (os.getpid(), line, time.time( ))

if not os.path.exists(pipe_name):
    os.mkfifo(pipe_name)  
pid = os.fork()    
if pid != 0:
    parent()
else:       
    child()

接下来我们检查JAVA的管道,果然发现问题

public class Test{
    public static void main(String[] args){
        System.out.println("Hello World.");
        String[] command = {"/usr/java/jdk1.8.0_102/bin/java", "-classpath", "/home/umelog", "Test2"};
        ProcessBuilder builder = new ProcessBuilder(command);
        try{
            builder.start();
        }catch(Exception e){
            e.printStackTrace();
            System.out.println("got exception");
        }
        System.out.println("Hello World.");
        System.out.println(System.getenv().toString());
        System.out.println(System.getProperties().toString());
        while(true){
            try{
                Thread.sleep(1000);
            }catch(Exception e){
            }
        }
    }
}

public class Test2{
    public static void main(String[] args){
                    System.out.println("{java.runtime.name=Java(TM) SE Runtime Environment, sun.boot.library.path=/usr/java/jdk1.8.0_102/jre/lib/amd64, java.vm.version=25.102-b14, java.vm.vendor=Oracle Corporation, java.vendor.url=http://java.oracle.com/, path.separator=:, java.vm.name=Java HotSpot(TM) 64-Bit Server VM, file.encoding.pkg=sun.io, user.country=US, sun.java.launcher=SUN_STANDARD, sun.os.patch.level=unknown, topology.name=demo, java.vm.specification.name=Java Virtual Machine Specification, user.dir=/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/bin, java.runtime.version=1.8.0_102-b14, java.awt.graphicsenv=sun.awt.X11GraphicsEnvironment, java.endorsed.dirs=/usr/java/jdk1.8.0_102/jre/lib/endorsed, os.arch=amd64, java.io.tmpdir=/tmp, line.separator=\n" +
                    ", java.vm.specification.vendor=Oracle Corporation, os.name=Linux, sun.jnu.encoding=UTF-8, java.library.path=/usr/local/lib:/opt/local/lib:/usr/lib:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib, java.specification.name=Java Platform API Specification, java.class.version=52.0, sun.management.compiler=HotSpot 64-Bit Tiered Compilers, os.version=3.10.0-514.el7.x86_64, user.home=/home/jboss5, user.timezone=, java.awt.printerjob=sun.print.PSPrinterJob, file.encoding=UTF-8, java.specification.version=1.8, java.class.path=/opt/app/GDLLprocess/jstorm/jstorm-2.2.1/conf:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/netty-3.9.0.Final.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/clojure-1.6.0.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/jstorm-core-2.2.1.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/logback-core-1.0.13.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/slf4j-api-1.7.5.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/log4j-over-slf4j-1.6.6.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/servlet-api-2.5.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/logback-classic-1.0.13.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/commons-logging-1.1.3.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/rocksdbjni-4.3.1.jar::/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/data/supervisor/stormdist/demo-209-1537185839/stormjar.jar, user.name=jboss5, java.vm.specification.version=1.8, sun.java.command=com.alibaba.jstorm.utils.ProcessLauncher java -server -Xms2147483648 -Xmx2147483648 -Xmn1073741824 -XX:PermSize=67108864 -XX:MaxPermSize=134217728 -XX:ParallelGCThreads=4 -XX:SurvivorRatio=4 -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:CMSFullGCsBeforeCompaction=5 -XX:+HeapDumpOnOutOfMemoryError -XX:+UseCMSCompactAtFullCollection -XX:CMSMaxAbortablePrecleanTime=5000 -Xloggc:/opt/applog/GDLLLog/jstorm/demo/demo-worker-6806-gc.log -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:HeapDumpPath=/opt/applog/GDLLLog/jstorm/demo/java-demo-209-1537185839-20180917200403.hprof -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Djstorm.home=/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1 -Djstorm.log.dir=/opt/applog/GDLLLog/jstorm -Dlogfile.name=demo-worker-6806.log -Dtopology.name=demo -Dlogback.configurationFile=/opt/app/GDLLprocess/jstorm/jstorm-2.2.1/conf/jstorm.logback.xml -cp /opt/app/GDLLprocess/jstorm/jstorm-2.2.1/conf:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/netty-3.9.0.Final.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/clojure-1.6.0.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/jstorm-core-2.2.1.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/logback-core-1.0.13.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/slf4j-api-1.7.5.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/log4j-over-slf4j-1.6.6.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/servlet-api-2.5.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/logback-classic-1.0.13.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/commons-logging-1.1.3.jar:/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/lib/rocksdbjni-4.3.1.jar::/DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/data/supervisor/stormdist/demo-209-1537185839/stormjar.jar com.alibaba.jstorm.daemon.worker.Worker demo-209-1537185839 294af591-c9d2-4021-b776-12c83ac081b3 6806 cc17c798-a85b-4dcf-b07b-207fbb0eb228 /DATA/app/GDLLprocess/jstorm/jstorm-2.2.1/data/supervisor/stormdist/demo-209-1537185839/stormjar.jar &, java.home=/usr/java/jdk1.8.0_102/jre, sun.arch.data.model=64, user.language=en, java.specification.vendor=Oracle Corporation, awt.toolkit=sun.awt.X11.XToolkit, java.vm.info=mixed mode, java.version=1.8.0_102, jstorm.log.dir=/opt/applog/GDLLLog/jstorm, java.ext.dirs=/usr/java/jdk1.8.0_102/jre/lib/ext:/usr/java/packages/lib/ext, logfile.name=demo-worker-6806.log, sun.boot.class.path=/usr/java/jdk1.8.0_102/jre/lib/resources.jar:/usr/java/jdk1.8.0_102/jre/lib/rt.jar:/usr/java/jdk1.8.0_102/jre/lib/sunrsasign.jar:/usr/java/jdk1.8.0_102/jre/lib/jsse.jar:/usr/java/jdk1.8.0_102/jre/lib/jce.jar:/usr/java/jdk1.8.0_102/jre/lib/charsets.jar:/usr/java/jdk1.8.0_102/jre/lib/jfr.jar:/usr/java/jdk1.8.0_102/jre/classes, logback.configurationFile=/opt/app/GDLLprocess/jstorm/jstorm-2.2.1/conf/jstorm.logback.xml, java.vendor=Oracle Corporation, file.separator=/, java.vendor.url.bug=http://bugreport.sun.com/bugreport/, sun.io.unicode.encoding=UnicodeLittle, sun.cpu.endian=little, sun.cpu.isalist=}");

        System.out.println("xxx");
        while(true){
            try{
                Thread.sleep(1000);
                System.out.println("Hello World form 2");
            }catch(Exception e){
            }
        }
    }
}

使用ulimt发现我的操作系统一次pipe的原子写入操作为512个字8节。最大容量为164k=64k。显然,一次性写入超过了4k,但是小于64k。造成阻塞显然和我的认知不同。使用C语言检查操作系统的特性

#include
#include
#include
#include
#include
#include
#include
int main()
{
    int i = 0;
    for(;i<2000;i++){
    int _pipe[2];
    if(pipe(_pipe)==-1)
    {
        printf("pipe error\n");
        return 1;
    }
    int ret;
    int count=0;
    int flag=fcntl(_pipe[1],F_GETFL);
    fcntl(_pipe[1],F_SETFL,flag|O_NONBLOCK);
    while(1)
    {
        ret=write(_pipe[1],"A",1);
        if(ret==-1)
        {
            printf("error %s\n",strerror(errno));
            break;
        }
        count++;
    }
    printf("%dcount=%d\n",i,count);
    }
    return 0;
}

使用上述代码检查操作系统特性,发现对于一个用户,在申请不超过1024个管道时,最大容量是64k。超过1024后管道容量就下降为4k了。似乎很多操作系统都有类似设定,但是网上似乎没有介绍。没办法,去翻linux源代码。
https://code.woboq.org/linux/linux/fs/pipe.c.html(强推网址)

//pipe.c
unsigned long pipe_user_pages_hard;
unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
struct pipe_inode_info *alloc_pipe_info(void)
{
	struct pipe_inode_info *pipe;
	unsigned long pipe_bufs = PIPE_DEF_BUFFERS;
	struct user_struct *user = get_current_user();
	unsigned long user_bufs;
	unsigned int max_size = READ_ONCE(pipe_max_size);
	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT);
	if (pipe == NULL)
		goto out_free_uid;
	if (pipe_bufs * PAGE_SIZE > max_size && !capable(CAP_SYS_RESOURCE))
		pipe_bufs = max_size >> PAGE_SHIFT;	
	user_bufs = account_pipe_buffers(user, 0, pipe_bufs);//累加统计目前user使用的睡昂
	if (too_many_pipe_buffers_soft(user_bufs) && is_unprivileged_user()) {//关键条件
		user_bufs = account_pipe_buffers(user, pipe_bufs, 1);
		pipe_bufs = 1;
	}
	...
}
static bool too_many_pipe_buffers_soft(unsigned long user_bufs)
{
	unsigned long soft_limit = READ_ONCE(pipe_user_pages_soft);
	return soft_limit && user_bufs > soft_limit;//就是user_bufs > soft_limit;
}

显然,在关键条件满足后,pipe_bufs值从PIPE_DEF_BUFFERS(也就是16)被降到了1。

现在我们的任务就是查明系统中为什么用到管道了。提到管道,不能只想到用于父子进程的通信。实际上,在JAVA NIO的实现中少不了管道。下面是我的MAC上的KQueue的具体实现。可以看到,使用IOUtil创建了一个管道。本进程读,也是本进程写。

    KQueueSelectorImpl(SelectorProvider var1) {
        super(var1);
        long var2 = IOUtil.makePipe(false);
        this.fd0 = (int)(var2 >>> 32);
        this.fd1 = (int)var2;

这个管道的作用是添加到用于的select列表中。select是阻塞方法,如果在中途收到interrupt,则无法相应。所以实现者机制在select列表中加入了一个自己的管道。当收到interrupt,直接向该管道读写,select因为收到内容就会从阻塞状态中退出。
知道了这些,代码里访问redis的redisson就被列入了怀疑对象。我们都知道redisson使用netty,也就是间接使用java nio。pipe数量

redisson打开的RedisClient数目为redis集群服务器数量*2 + 1,在我们的环境中这个值是45。相应的JVM中也就是存在45个RedisClientConfig。

public class RedisClientConfig {
    private ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * 2);
    private EventLoopGroup group = new NioEventLoopGroup();

在每个RedisClientConfig对象中有一个默认参数的NioEventLoopGroup。NioEventLoopGroup是一个线程池,默认最大的线程数为

    private static final int DEFAULT_EVENT_LOOP_THREADS = Math.max(1, SystemPropertyUtil.getInt("io.netty.eventLoopThreads", NettyRuntime.availableProcessors() * 2));

本地电脑的CPU核心为2,默认值也就是8.
NioEventLoopGroup的部分机制如下

           //MultithreadEventLoopGroup.java
			this.children = new EventExecutor[nThreads];
            for(int i = 0; i < nThreads; ++i) {
                   this.children[i] = this.newChild((Executor)executor, args);
            }

   //NioEventLoopGroup.java
    protected EventLoop newChild(Executor executor, Object... args) throws Exception {
        return new NioEventLoop(this, executor, (SelectorProvider)args[0], ((SelectStrategyFactory)args[1]).newSelectStrategy(), (RejectedExecutionHandler)args[2]);
    }
           

在MAC上,最终使用的是

    public AbstractSelector openSelector() throws IOException {
        return new KQueueSelectorImpl(this);
    }

所以,在本地进行测试时,打开的pipe的数量大概是4582个。一共是720个管道。线上情况类似。一旦起两个类似java工程,就会造成管道数目大于1000,从而容量降到了4k,最终引起ProcessLauncher在输出到管道时阻塞。

你可能感兴趣的:(故障排查,java)