最近又遇到Pig job失败问题, 将heap dump拉下来分析发现很可笑:
怎么也不会想到一个DeleteOnExitHook也会导致OOM
翻看java.io.File 的源代码, 如下:
public void deleteOnExit() { SecurityManager security = System.getSecurityManager(); if (security != null) { security.checkDelete(path); } DeleteOnExitHook.add(path); }
而DeleteOnExitHook 代码如下:
class DeleteOnExitHook { static { sun.misc.SharedSecrets.getJavaLangAccess() .registerShutdownHook(2 /* Shutdown hook invocation order */, new Runnable() { public void run() { runHooks(); } }); } private static LinkedHashSet<String> files = new LinkedHashSet<String>(); private DeleteOnExitHook() {} static synchronized void add(String file) { if(files == null) throw new IllegalStateException("Shutdown in progress"); files.add(file); } static void runHooks() { LinkedHashSet<String> theFiles; synchronized (DeleteOnExitHook.class) { theFiles = files; files = null; } ArrayList<String> toBeDeleted = new ArrayList<String>(theFiles); // reverse the list to maintain previous jdk deletion order. // Last in first deleted. Collections.reverse(toBeDeleted); for (String filename : toBeDeleted) { (new File(filename)).delete(); } } }
Pig DataBag的spill文件是如何写的呢?
spill文件的创建逻辑是在DefaultAbstractDataBag中, 代码如下:
protected DataOutputStream getSpillFile() throws IOException { if (mSpillFiles == null) { // We want to keep the list as small as possible. mSpillFiles = new FileList(1); } String tmpDirName= System.getProperties().getProperty("java.io.tmpdir") ; File tmpDir = new File(tmpDirName); // if the directory does not exist, create it. if (!tmpDir.exists()){ log.info("Temporary directory doesn't exists. Trying to create: " + tmpDir.getAbsolutePath()); // Create the directory and see if it was successful if (tmpDir.mkdir()){ log.info("Successfully created temporary directory: " + tmpDir.getAbsolutePath()); } else { // If execution reaches here, it means that we needed to create the directory but // were not successful in doing so. // // If this directory is created recently then we can simply // skip creation. This is to address a rare issue occuring in a cluster despite the // the fact that spill() makes call to getSpillFile() in a synchronized // block. if (tmpDir.exists()) { log.info("Temporary directory already exists: " + tmpDir.getAbsolutePath()); } else { int errCode = 2111; String msg = "Unable to create temporary directory: " + tmpDir.getAbsolutePath(); throw new ExecException(msg, errCode, PigException.BUG); } } } File f = File.createTempFile("pigbag", null); f.deleteOnExit(); mSpillFiles.add(f); return new DataOutputStream(new BufferedOutputStream( new FileOutputStream(f))); }
而Pig DataBag 是一个元素使用一个spill file, 这个就导致了过多的deleteOnExit调用. 再加上hadoop child 进程的重用, 多次任务的spill 文件持续不删除, 从而导致了OOM.
还有一个问题需要注意的是, 如果调用了file.deleteOnExit, 即使该文件已经调用file.delete()删除了, JVM也会持有该文件的信息, 以便在进程结束的时候删除.
解决方法也很简单, 我提交了一个patch: PIG-2812 , 将所有的databag的spill文件 spill到一个文件夹内, 在Shutdownhook中删除该文件夹即可.
--EOF--