Apache Pig DataBag spill 文件过多导致OOM问题


最近又遇到Pig job失败问题, 将heap dump拉下来分析发现很可笑:

Apache Pig DataBag spill 文件过多导致OOM问题_第1张图片

怎么也不会想到一个DeleteOnExitHook也会导致OOM

翻看java.io.File 的源代码, 如下:

    public void deleteOnExit() {
	SecurityManager security = System.getSecurityManager();
	if (security != null) {
	    security.checkDelete(path);
	}
	DeleteOnExitHook.add(path);
    }

而DeleteOnExitHook 代码如下:

class DeleteOnExitHook {
    static {
         sun.misc.SharedSecrets.getJavaLangAccess()
             .registerShutdownHook(2 /* Shutdown hook invocation order */,
                 new Runnable() {
                     public void run() {
                        runHooks();
                     }
                 });
    }

    private static LinkedHashSet<String> files = new LinkedHashSet<String>();

    private DeleteOnExitHook() {}

    static synchronized void add(String file) {
	if(files == null)
	    throw new IllegalStateException("Shutdown in progress");

	files.add(file);
    }

    static void runHooks() {
	LinkedHashSet<String> theFiles;

	synchronized (DeleteOnExitHook.class) {
	    theFiles = files;
	    files = null;
	}

	ArrayList<String> toBeDeleted = new ArrayList<String>(theFiles);

	// reverse the list to maintain previous jdk deletion order.
	// Last in first deleted.
	Collections.reverse(toBeDeleted);
	for (String filename : toBeDeleted) {
	    (new File(filename)).delete();
	}
    }
}

原因就一目了然了: 过多的文件调用了deleteOnExit, 根据上次的经验以及对heap dump的分析, 发现又是pig DataBag的spill文件问题(参照上一篇: Apache Pig Reducer OOM 解决 -- DataBag )

Pig DataBag的spill文件是如何写的呢?

spill文件的创建逻辑是在DefaultAbstractDataBag中, 代码如下:

    protected DataOutputStream getSpillFile() throws IOException {
        if (mSpillFiles == null) {
            // We want to keep the list as small as possible.
            mSpillFiles = new FileList(1);
        }

        String tmpDirName= System.getProperties().getProperty("java.io.tmpdir") ;                
        File tmpDir = new File(tmpDirName);
  
        // if the directory does not exist, create it.
        if (!tmpDir.exists()){
            log.info("Temporary directory doesn't exists. Trying to create: " + tmpDir.getAbsolutePath());
          // Create the directory and see if it was successful
          if (tmpDir.mkdir()){
            log.info("Successfully created temporary directory: " + tmpDir.getAbsolutePath());
          } else {
              // If execution reaches here, it means that we needed to create the directory but
              // were not successful in doing so.
              // 
              // If this directory is created recently then we can simply 
              // skip creation. This is to address a rare issue occuring in a cluster despite the
              // the fact that spill() makes call to getSpillFile() in a synchronized 
              // block. 
              if (tmpDir.exists()) {
                log.info("Temporary directory already exists: " + tmpDir.getAbsolutePath());
              } else {
                int errCode = 2111;
                String msg = "Unable to create temporary directory: " + tmpDir.getAbsolutePath();
                throw new ExecException(msg, errCode, PigException.BUG);                  
              }
          }
        }
        File f = File.createTempFile("pigbag", null);
        f.deleteOnExit();
        mSpillFiles.add(f);
        return new DataOutputStream(new BufferedOutputStream(
            new FileOutputStream(f)));
    }

而Pig DataBag 是一个元素使用一个spill file, 这个就导致了过多的deleteOnExit调用. 再加上hadoop child 进程的重用, 多次任务的spill 文件持续不删除, 从而导致了OOM.

还有一个问题需要注意的是, 如果调用了file.deleteOnExit, 即使该文件已经调用file.delete()删除了, JVM也会持有该文件的信息, 以便在进程结束的时候删除.


解决方法也很简单, 我提交了一个patch: PIG-2812  , 将所有的databag的spill文件 spill到一个文件夹内, 在Shutdownhook中删除该文件夹即可.


--EOF--

你可能感兴趣的:(apache,String,pig,File,Security,null)