Not many Java programmers knows that socket connections are treated like files and they use file descriptor, which is a limited resource. Different operating system has different limits on number of file handles they can manage. One of the common reason of java.net.SocketException: Too many files open in Tomcat, Weblogic or any Java application server is, too many clients connecting and disconnecting frequently at very short span of time. Since Socket connection internally use TCP protocol, which says that a socket can remain in TIME_WAIT state for some time, even after they are closed. One of the reason to keep closed socket in TIME_WAIT state is to ensure that delayed packets reached to the corresponding socket. Different operating system has different default time to keep sockets in TIME_WAIT state, in Linux it's 60 seconds, while in Windows is 4 minutes. Remember longer the timeout, longer your closed socket will keep file handle, which increase chances of java.net.SocketException: Too many files open exception. This also means, if you are running Tomcat, Weblogic, Websphere or any other web server in windows machine, you are more prone to this error than Linux based systems e.g. Solaris or Ubuntu. By the way this error is same as java.io.IOException: Too many files open exception, which is throw by code from IO package if you try to open a new FileInputStream or any stream pointing to file resource.
Now, we know that this error is coming because clients are connecting and disconnecting frequently. If that's seems unusual to your application, you can find the culprit client and prohibit them from reconnecting from making a connection, but if that is something, your application may expect and you want to handle it on your side, you have two options :
1) Increase number of open file handles or file descriptors per process.
2) Reduce timeout for TIME_WAIT state in your operating system
In UNIX based operating system e.g. Ubuntu or Solaris, you can use command ulimit -a to find out how many open file handles per process is allowed.
$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 10
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2048
virtual memory (kbytes, -v) unlimited
You can see that, open files (-n) 256, which means only 256 open file handles per process is allowed. If your Java program, remember Tomcat, weblogic or any other application server are Java programs and they run on JVM, exceeds this limit, it will throw java.net.SocketException: Too many files open error.
You can change this limit by using ulimit -n to a larger number e.g. 4096, but do it with advise of UNIX system administrator and if you have separate UNIX support team, than better escalate to them.
Another important thing to verify is that, your process is not leaking file descriptors or handles, well that's a tedious thing to find out, but you can use lsof command to check how many open file handles is owned by a particular process in UNIX or Linux. You can run lsof command by providing PID of your process, which you can get it from ps command.
Similarly, you can change TIME_WAIT timeout, but do with consultation of UNIX support, as a really low time means, you might miss delayed packets. In UNIX based systems, you ca n see current configuration in /proc/sys/net/ipv4/tcp_fin_timeout file. In Windows based system, you can see this information in windows registry. You can change the TCP TIME_WAIT timeout in Windows by following below steps :
1) Open Windows Registry Editor, by typing regedit in run command window
2) Find the key HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\tcpip\Parameters
3) Add a new key value pair TcpTimedWaitDelay asa decimal and set the desired timeout in seconds (60-240)
4) Restart your windows machine.
Remember, you might not have permission to edit windows registry, and if you are not comfortable, better not to do it. Instead ask Windows Network support team, if you have any, to do that for you. Bottom line to fix java.net.SocketException: Too many files open, is that either increasing number of open file handles or reducing TCP TIME_WAIT timeout. java.net.SocketException: Too many files open issue is also common among FIX Engines, where client use TCP/IP protocol to connect with brokers FIX servers. Since FIX engines needs correct value of incoming and outgoing sequence number to establish FIX session, and if client tries to connect with a smaller sequence number than expected at brokers end, it disconnects the session immediately. If client is well behind, and keep retrying by increasing sequence number by 1, it can cause java.net.SocketException: Too many files open at brokers end. To avoid this, let's FIX engine keep track of it's sequence number, when it restart. In short, "java.net.SocketException: Too many files open" can be seen any Java Server application e.g. Tomcat, Weblogic, WebSphere etc, with client connecting and disconnecting frequently.
一、错误现象
tomcat启动后,会出现前台页面无法访问,从日志中看错误:
2011-03-01 02:30:00 [com.asiainfo.aiox.common.rest.RestClient]-[ERROR] java.net.SocketException: Too many open files
Exception in thread "Thread-1168" java.lang.NoClassDefFoundError: org/apache/log4j/spi/ThrowableInformation
at org.apache.log4j.spi.LoggingEvent.<init>(LoggingEvent.java:159)
at org.apache.log4j.Category.forcedLog(Category.java:391)
2.错误原因
Linux默认打开文件1024,对于并发量大的无法满足要求;
(1)查看系统设置
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 253951
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 253951
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
通过以上命令,我们可以看到open files 的最大数为1024
对于并发量比较大的网站这个限制是有些捉襟见肘的
(2)解决方法
通过这个命令
ulimit -n 4096
把打开文件数的上限设为了4096,这下好了,项目又稳定了
但是,ulimit -n 4096 命令只能临时的改变open files 的值,当重新登陆后又会恢复,所以需要永久设置open files 的值。
很多系统上限可以通过修改/etc/security/limits.conf文件改变,这个文件有详细的注释,对如何修改做了说明。如果希望把所有用户的进程打开文件上限改为65536,可以加入下面两行
* soft nofile 65535
* hard nofile 65535
其中,*表示所有用户,soft/hard表示软/硬限制,还可以只真对某个用户或某个组做修改,具体方法参见文件注释。修改后需要重新启动系统才能生效。
3.涉及命令
ulimit -a 显示当前所有的 limit 信息
ulimit -n 可以打开最大文件描述符的数量
ulimit – n 4096;限制最大可以使用 4096个文件描述符
用lsof -p [进程ID] 可以看到某ID的打开文件状况。进程ID可能用 ps -ef|grep java列出weblogic的进程ID,然后用此ID套入lsof -p ID号,咳,一大堆的请求哟,这显然是网络请求过多造成了 Too many open files。适当调整后便已消除这种现象。
解决方法指导原则
下面是一般指导原则和考虑事项:
确定文件描述符的总数是否太少或者某些文件描述符是否未被正确释放。
这可以通过以下方法来诊断:在不同的时期检查文件描述符的总数,确定此数量是有所减少还是一直增加。
如果此数量有所减少,则应当增加文件描述符的最大数量,以防止该问题再次发生。
此变化可以和减少连接在断开之前保持 TIME_WAIT 状态的时间结合起来。在繁忙的服务器上,缺省值(240 秒)会延迟其它连接企图,从而将限制最大连接数量。
如果此数量一直增加,则应当确定某些描述符的处理时间是否过长(文件没有正确地关闭)以及所创建的文件是否过多(例如,驱动程序库一直为每个新的 JDBC 连接加载文件)。
加载 jar 文件还可能减少所使用的文件描述符的数量。每个 jar 文件都使用一个描述符,即使每个单独加载的单一类都将使用一个描述符时也是如此。
您可以使用下列指导原则来监视和诊断所有描述符如何由一个进程使用(这取决于您的操作系统)。
检查打开的文件
Unix 平台
在诸多工具中,lsof (LiSt Open Files) Unix 管理工具(适用于 Solaris、Tru64、HP-UX、Linux 和 AIX)显示有关打开文件和网络文件描述符的信息,包括它们的类型、大小和 i-节点。
对于特定的进程,其语法如下所示:
lsof -p <进程的 pid> 示例 1 以下命令在 Solaris 2.7 启动 WLS 8.1SP1 后立即执行。它表明运行服务器的 Java 进程 (pid 390) 分配了 84 个文件描述符,此数量远小于文件描述符缺省的硬极限。
$ lsof -p 390 | wc -l
84
在异常出现之后可以执行此命令,以确保此 Java 进程达到了打开文件的最大数量。这将确认该进程缺乏文件描述符。
然后,您可以运行 $ lsof -p <pid> 并将输出结果重定向到某个文件以检查打开的每个文件。如果某个应当关闭的文件却出现在列表中,您可以探查此文件以前没有按照预期方式关闭的原因。
下面是 lsof 的输出结果片断:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 29733 usera cwd VDIR 176,22 4096 4300274 /home/usera/810/user_projects/mydomain
java 29733 usera txt VREG 176,22 36396 6642305 /home/usera/810/jdk141_02/bin/java
java 29733 usera txt VREG 176,22 1251192 10818087 /home/usera/810/user_projects/mydomain/myserver/.wlnotdelete/extract/myserver_uddi_uddi/
jarfiles/_wl_cls_gen.jar
java 29733 usera txt VREG 176,22 511935 10074851 /home/usera/810/user_projects/mydomain/myserver/.wlnotdelete/extract/myserver_uddi_uddi/
jarfiles/WEB-INF/lib/jsse39153.jar
java 29733 usera txt VREG 176,22 2305960 6000676 /home/usera/810/user_projects/mydomain/myserver/.internal/uddi.war
java 29733 usera txt VREG 176,22 1227013 1385413 /home/usera/810/weblogic81/common/eval/pointbase/lib/pbserver44.jar
java 29733 usera txt VREG 176,22 653661 69379 /home/usera/810/weblogic81/server/lib/ant/
optional.jar
lsof .h 显示所有可能的语法和选项。此程序的最新版本可在以下网址中找到:http://ftp.cerias.purdue.edu/pub/tools/unix/sysutils/lsof/。
文件描述符将用于每个套接字连接,lsof 还可以显示套接字的类型(TCP 或 UDP)以及监听地址和端口(位于名称列中)。
Windows 平台
Handle
在 WinNT 或 Windows 2000 上,命令行工具 handle 报告有关引用所打开文件的句柄的信息,如下例所示。此工具可用于特定进程,
它可从以下网址获得:http://www.sysinternals.com/ntw2k/freeware/handle.shtml。
C:\tmp>ps -ef | grep java
usera 1656 1428 0 10:11:41 CONIN$ 0:46 c:\Releases\WLS8.2\JDK141~1\bin\java -client -Xms32m -Xmx200m -XX:MaxPermSize=128m -Xverify:none -Dweblogic.Name=myserver -Dweblogic.ProductionModeEnabled= -Djava.security.policy="c:\Releases\WLS8.2\WEBLOG~1\server\lib\weblogic.policy" weblogic.Server
C:\tmp>handle -p java
Handle v2.10
Copyright (C) 1997-2003 Mark Russinovich
Sysinternals - www.sysinternals.com
------------------------------------------------------------------------------
java.exe pid: 1656 ABCDEF\usera
18: File C:\Releases\WLS8.2\user_projects\domains\mydomain
170: File C:\Releases\WLS8.2\jdk141_05\jre\lib\rt.jar
178: File C:\Releases\WLS8.2\jdk141_05\jre\lib\sunrsasign.jar
180: File C:\Releases\WLS8.2\jdk141_05\jre\lib\jsse.jar
188: File C:\Releases\WLS8.2\jdk141_05\jre\lib\jce.jar
190: File C:\Releases\WLS8.2\jdk141_05\jre\lib\charsets.jar
328: File C:\Releases\WLS8.2\jdk141_05\jre\lib\ext\dnsns.jar
330: File C:\Releases\WLS8.2\jdk141_05\jre\lib\ext\ldapsec.jar
338: File C:\Releases\WLS8.2\jdk141_05\jre\lib\ext\localedata.jar
340: File C:\Releases\WLS8.2\jdk141_05\jre\lib\ext\sunjce_provider.jar
348: File C:\Releases\WLS8.2\jdk141_05\lib\tools.jar
350: File C:\Releases\WLS8.2\weblogic81\server\lib\weblogic.jar
358: File C:\Releases\WLS8.2\weblogic81\server\lib\jconn2.jar
360: File C:\Releases\WLS8.2\weblogic81\server\lib\ojdbc14.jar
368: File C:\Releases\WLS8.2\weblogic81\server\lib\xmlx.jar
370: File C:\Releases\WLS8.2\weblogic81\server\lib\webservices.jar
378: File C:\Releases\WLS8.2\weblogic81\server\lib\wlcipher.jar
3e0: File C:\Releases\WLS8.2\weblogic81\server\lib\ant\ant.jar
3e8: File C:\Releases\WLS8.2\weblogic81\server\lib\EccpressoJcae.jar
3f0: File C:\Releases\WLS8.2\weblogic81\server\lib\EccpressoCore.jar
3f8: File C:\Releases\WLS8.2\weblogic81\server\lib\EccpressoAsn1.jar
400: File C:\Releases\WLS8.2\weblogic81\server\lib\jConnect.jar
408: File C:\Releases\WLS8.2\weblogic81\server\lib\ant\optional.jar
410: File C:\Releases\WLS8.2\weblogic81\server\lib\ant\jakarta-oro-2.0.7.jar
….
C:\tmp>handle -p java | wc -l
65
这表明在 Windows 上运行 WLS 8.1SP2 时使用了 65 个文件句柄。
如何在不同平台上定义文件描述符的数量
Linux
管理用户可以在 etc/security/limits.conf 配置文件中设置他们的文件描述符极限,如下例所示。
soft nofile 1024
hard nofile 4096
系统级文件描述符极限还可以通过将以下三行添加到 /etc/rc.d/rc.local 启动脚本中来设置:
# Increase system-wide file descriptor limit.
echo 4096 > /proc/sys/fs/file-max
echo 16384 > /proc/sys/fs/inode-max
Windows
在 Windows 操作系统上,文件描述符被称作文件句柄。在 Windows 2000 服务器上,打开文件的句柄极限设置为 16,384。此数量可以在任务管理器的性能摘要中监视。