应用服务器发生 hang 的诊断方法

 

其实这是BEA官网上的一篇文档,是在weblogic8.1的时候推出的。在BEA被Oracle收购后,所有的support文章也就被重定向到Oracle的官网首页= =,而且google的快照也没有了。这篇来自无意间google到的一个外国论坛,虽然是写在8.1时,但是解决问题的方法和思路现在依旧有效。本想理解之后结合案例来写一篇,但是最近一直没有遇到相关的问题,而且觉得那样也许会破坏文章的完整性,所以放出原文,既在网上留个副本,也能让大家各取所需,见仁见智。

从内容看,你会发现除了这篇,还有EJB_RMI Server Hang、Application Dead Lock、JDBC Causes Server Hang,但是那个论坛里还能找到的仅有JDBC Causes Server Hang一篇。所以如果你接触weblogic比较早,保存过另两篇文章,或者在网上看到了,那请留言说明,万分感谢。

Generic Hang

Problem Description
A server hang is suspected when:

 

  • The server does not respond to new requests.
  • Requests time out.
  • Requests take longer and longer to process (may be on the way to a hang).
  • A server crash is not usually a symptom of a hung server but may follow.
Problem Troubleshooting
Please note that not all of the following items would need to be done. Some issues can be solved by only following a few of the items.Quick Links:

 

  • Why does the problem occur?
  • Potential Causes of Server Hang
  • Basic Steps
  • Known WebLogic Server Issues
  • Collecting Thread Dumps
  • Analysis of a Thread Dump

Why does the problem occur?
A server can hang for a variety of reasons (refer to Potential Causes of Server Hang). Generally, a server hangs because of a lack of some resource. Lack of a resource prevents the server from servicing requests. For example, because of a problem (deadlock) or volume of requests there may be no execute threads available to do any work; all are busy or busy with previous requests.

Top of Page

Topic
Pattern Name
Link
RMI, RJVM responses – all threads tied up waiting for RJVM, RMI responses. EJB_RMI Server Hang
EJB_RMI Server Hang
Application Deadlock – thread locks resource1 then waits for lock for resource2. Another thread locks resource2 and then waits for lock for resource1. Application Deadlock Causes Server Hang
Application Dead Lock
Threads are all used up, none available for new work. Thread Usage Server Hang TBD
Garbage Collection taking too much time. Garbage Collection Server Hang TBD
JSP improper settings for servlet times, e.g. PageCheckSeconds. JSP cause Server Hang TBD
Long Running JDBC calls or JDBC deadlocks lead to a hang. JDBC Causes Server Hang JDBC Causes Server Hang
JVM hang during (code optimization), looks like server hang. Server Hang in Code Optimization TBD
JSP compilation causes server hang under heavy load. JSP Compilation Server Hang TBD
SUN JVM bugs, e.g. Light weight thread library. Sun JVM Bugs that Cause Server Hangs TBD
Top of Page

 


When a server is hanging, first ping the server using java weblogic.Admin t3://server:port PING. If the server can respond to the ping, it may be that the application is hanging and not the server itself.

Ensure that the server is actually hanging and not doing garbage collection. To verify, restart the server with -verbosegc turned on, and redirect stdout and stderr to one file. When the server stops responding, it can be determined if it’s doing garbage collection or it is really hanging.  If the garbage collection is taking too long (>10 seconds), the server may miss the heartbeats that servers use to keep each other informed of the topoplogy of the cluster.

WebLogic Server uses the ‘default’ thread queue or a configured application specific thread queue to service client requests. Client requests will only be handled in the default queue if no application specific thread queue is defined.  Please see Tuning WebLogic Server Applications, Tuning the Default Execute Queue Threads, and Tuning WebLogic Server Performance Parameters for more information on defining application specific thread queues.

In release 8.1, a change was made to the thread architecture in WebLogic Server.  A specific kernel thread group for internal WebLogic tasks was created.  This was found to be necessary to avoid deadlocks that occurred in earlier releases when all threads in the ‘default’ thread queue were used and none were thus available for WebLogic internal tasks.

The threads in the ‘default’ queue or the application specific thread queue (if one has been configured) are the threads that should be examined in the event of a server hang. Here’s an example of what one of these threads looks like in a thread dump. Execute Thread ‘14′ from the ‘default’ queue looks like in a thread dump when the thread is waiting for work. The latest method called by this thread is Object.wait(). This thread is in a state “waiting on monitor”.

“ExecuteThread: ‘14′ for queue: ‘default’” daemon prio=5 tid=0×8b0ab30 nid=0×1f4 waiting on monitor [0x96af000..0x96afdc4]
at
java.lang.Object.wait(Native Method)
at
java.lang.Object.wait(Object.java:420)
at
weblogic.kernel.ExecuteThread.waitForRequest(ExecuteThread.java:94)
at
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:118)
Threads can be in one of several states.  Please see the table below for a description of the thread states.
The format of the thread dump varies with the vendor.  Check on the vendor’s website for information regarding the format.

 

Below is an example of  threads that  may  be hanging.  ExecuteThread ‘9′ is waiting to lock some object <dde51520>.   Notice the “waiting to lock <dde51520>” line in the stack trace for this thread.  ExecuteThread ‘6′ is also “waiting to lock the same object <dde51520>”.  The third thread, ExecuteThread ‘5′ has locked this object <dde51520>and is doing work.  This  example demonstrates why one thread dump is not enough.  If the server is hanging, and it is suspected that the cause is the locked object <dde51520>, then subsequent thread dumps will show whether or not that object was released and a new thread has locked object <dde51520>.  If after several thread dumps,  you do not see that the threads have progressed, that object <dde51520> has not been released, you may suspect that there is a problem with the routine(s) in the ExecuteThread ‘5′ call stack because the lock is not being released.

“ExecuteThread: ‘9′ for queue: ‘weblogic.kernel.Default’” daemon prio=5 tid=0xf684c8 nid=0×13 waiting for monitor entry [cc2ff000..cc2ffc24]
at weblogic.cluster.MemberManager.done(MemberManager.java:306)
- waiting to lock <dde51520> (a weblogic.cluster.MemberManager)
at weblogic.cluster.MulticastManager.execute(MulticastManager.java:399)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:197)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:170)

 

“ExecuteThread: ‘6′ for queue: ‘weblogic.kernel.Default’” daemon prio=5 tid=0×9df020 nid=0×10 waiting for monitor entry [cc5ff000..cc5ffc24]
at weblogic.cluster.MemberManager.getRemoteMembers(MemberManager.java:396)
- waiting to lock <dde51520> (a weblogic.cluster.MemberManager)
at weblogic.cluster.ClusterService.getRemoteMembers(ClusterService.java:238)
at weblogic.servlet.internal.HttpServer.setServerList(HttpServer.java:388)
at weblogic.servlet.internal.HttpServer.clusterMembersChanged(HttpServer.java:418)
- locked <ddf32360> (a weblogic.servlet.internal.HttpServer)
at weblogic.cluster.MemberManager$2.execute(MemberManager.java:421)
at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:197)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:170)

“ExecuteThread: ‘5′ for queue: ‘weblogic.kernel.Default’” daemon prio=5 tid=0×9df020 nid=0×12 waiting for monitor entry [cc5ff000..cc5ffc24]
. . .

at weblogic.cluster.MemberManager.checkTimeouts(MemberManager.java:346)
- locked <dde51520> (a weblogic.cluster.MemberManager)
at weblogic.cluster.MulticastManager.trigger(MulticastManager.java:291)
at weblogic.time.common.internal.ScheduledTrigger.run(ScheduledTrigger.java:243

 

Determine if the”default” ExecuteThread queue is overloaded. Use the console to determine if any of the ExecuteThreads in the ‘default’ queue are idle. If none are idle, then the application probably needs to be configured with a larger number of ExecuteThreads. This value can be changed through the console and is in the config.xml file.

 

If the Execute Queue has idle threads, it is possible that not enough socket reader threads are allocated. By default, a WebLogic Server instance creates three socket reader threads upon booting. If a cluster system utilizes more than three sockets during peak periods, increase the number of socket reader threads.

The number of socket reader threads should usually be small. However, configure one thread for each Weblogic Server that acts as a client of the server instance that is hanging.

If using a JDBC connection pool, ensure that the JDBC connections have been configured to be equivalent to the number of simultaneous requests, i.e., execute threads, for the pool.

Top of Page


The possibility exists that a problem with JDBC could produce deadlock. Check the version and service pack level of the server found in the beginning of the weblogic.log. Then check above the version and service pack lines for any temporary patches that have already been applied to the server classpath. The patches will tell what problems have already been addressed.

Top of Page


The way to take a thread dump is dependent on the operating system where the hung server instance is installed. Information about taking a thread dump on various operating systems can be found at http://e-docs.bea.com/wls/docs81/cluster/trouble.html#gc. Redirection of both standard error and standard out places the thread dump information in the proper context with server information and other messages and provides more useful logs.

Unix Systems (Solaris, HP, AIX)
Use kill –3 <weblogic process id> to create the necessary thread dumps to diagnose a problem. Ensure this is done several times on each server, spaced about 5 to 10 seconds apart, to help diagnose deadlocks. For this to work, nohup the process when starting the server (refer to Solutions S-12292 and S-15924).

Windows, XP, NT
Each server requires <Ctrl>-<Break> to create the necessary thread dumps to diagnose a problem. Ensure this is done several times on each server, spaced about 5 to 10 seconds apart, to help diagnose deadlocks. On NT, in the command shell type CTRL-Break.

If you have installed WebLogic as a Windows service, you will not be able to see the messages from the JVM or WebLogic Server that are printed to standard out or standard error.  To view these messages, you must direct standard out and standard error to a file.  To do this, take the following steps:

  1. Create a backup copy of the WL_HOME/server/bin/installSvc.cmd master script.
  2. In a text editor, open the WL_HOME/server/bin/installSvc.cmd master script.
  3. In installSvc.cmd, the last command in the script invokes the beasvc utility.
  4. At the end of the beasvc command, append the command -log:”pathname”
    where pathname is a fully qualified path and filename of the file that you want to store the server’s standard out and standard error messages.
  5. The modified beasvc command will resemble the following command:
    “%WL_HOME%/server/bin/beasvc” -install
    -svcname:”%DOMAIN_NAME%_%SERVER_NAME%”
    -javahome:”%JAVA_HOME%” -execdir:”%USERDOMAIN_HOME%”
    -extrapath:”%WL_HOME%/server/bin” -password:”%WLS_PW%”
    -cmdline:%CMDLINE%
    -log:”d:/bea/user_projects/domains/myWLSdomain/myWLSserver-stdout.txt”
  6. If you started WebLogic with nohup, the log messages will show up in nohup.out.

Linux
The Linux operating system views threads differently than other operating systems. Each thread is seen by the operating system as a process. To take a thread dump on Linux, find the process id from which all the other processes were started. Use the commands:

  • To obtain the root PID, use:

    ps -efHl | grep ‘java’ **. **

Use a grep argument that is a string that will be found in the process stack that matches the server startup command. The first PID reported will be the root process, assuming that the ps command has not been piped to another routine.

  • Use the weblogic.Admin command THREAD_DUMP

Another method of getting a thread dump is to use the THREAD_DUMP admin command. This method is independent of the OS on which the server instance is running.

java weblogic.Admin -url ManagedHost:8001 -username weblogic -password weblogic THREAD_DUMP

NOTE: This command cannot be used if unable to ping the server instance.

If the JVM in use is Sun’s, the thread dump goes to stdout. Sun has enhanced the thread dump format between JVM 1.3.1 and 1.4. To obtain Sun’s 1.4 style of thread dump add the following option to the java command line for starting the 1.3.1 JVM:

-XX:+JavaMonitorsInStackTrace

Top of Page


The most useful tool in analyzing a server hang is a set of thread dumps. A thread dump provides information on what each of the threads is doing at a particular moment in time. A set of thread dumps (usually 3 or more taken 5 to 10 seconds apart) can help analyze the change or lack of change in each thread’s state from one thread dump to another. A hung server thread dump would typically show little change in thread states from the first to the last dump.

Threads can be in one of the following states:

Running or runnable thread A runnable state means that the threads could be running or are running at that instance in time.
Suspended thread Thread has been suspended by the JVM.
Thread waiting on a condition variable Threads in a condition wait state can be thought of as waiting for an event to occur.
Thread waiting on a monitor lock Monitors are used to manage access to code that should only be run by a single thread at a time

More information on thread states can be found at http://java.sun.com/developer/onlineTraining/Programming/JDCBook/stack.html#states.

There is also a thread analysis tool at http://dev2dev.bea.com/resourcelibrary/utilitiestools/adminmgmt.jsp.
Download the tool and read the instructions at the link.

What to Look at in the Thread Dump

All requests enter the WebLogic Server through the ListenThread. If the ListenThread is gone, no work can be received and therefore no work can be done. Verify that a ListenThread exists in the thread dump. The ListenThread should be in the socketAccept method. The following example shows what the Listen Thread looks like:

“ListenThread.Default” prio=10 tid=0×00037888 nid=93 lwp_id=6888343 runnable [0x 1a81b000..0x1a81b530] at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.PlainSocketImpl.accept(PlainSocketImpl.java:353)
- locked <0×26d9d490> (a java.net.PlainSocketImpl)
at
java.net.ServerSocket.implAccept(ServerSocket.java:439)
at
java.net.ServerSocket.accept(ServerSocket.java:410)
at
weblogic.socket.WeblogicServerSocket.accept(WeblogicServerSocket.java:24)
at
weblogic.t3.srvr.ListenThread.accept(ListenThread.java:713)
at
weblogic.t3.srvr.ListenThread.run(ListenThread.java:290)
Socket Reader Threads accept the incoming request from the Listen Thread Queue and put it on the Execute Thread Queue. If there are no socket reader threads in the thread dump, then there is a bug somewhere that is causing the socket reader thread to vanish. There should always be at least 3 socket reader threads. One socket reader thread is usually in the poll function, while the other two are available to process requests. Below are Socket Reader threads from a sample thread dump.
“ExecuteThread: ‘2′ for queue: ‘weblogic.socket.Muxer’” daemon prio=10 tid=0×000 36128 nid=75 lwp_id=6888070 waiting for monitor entry [0x1b12f000..0x1b12f530]
at
weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:92)
- waiting to lock <0×25c01198> (a java.lang.String)
at
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:178)
at
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:151)

 

“ExecuteThread: ‘1′ for queue: ‘weblogic.socket.Muxer’” daemon prio=10 tid=0×000 35fc8 nid=74 lwp_id=6888067 runnable [0x1b1b0000..0x1b1b0530] at weblogic.socket.PosixSocketMuxer.poll(Native Method)
at
weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:99)
– locked <0×25c01198> (a java.lang.String)
at
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:178)
at
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:151)

“ExecuteThread: ‘0′ for queue: ‘weblogic.socket.Muxer’” daemon prio=10 tid=0×000 35e68 nid=73 lwp_id=6888066 waiting for monitor entry [0x1b231000..0x1b231530]
at
weblogic.socket.PosixSocketMuxer.processSockets(PosixSocketMuxer.java:92)
- waiting to lock <0×25c01198> (a java.lang.String)
at
weblogic.socket.SocketReaderRequest.execute(SocketReaderRequest.java:32)
at
weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:178)
at
weblogic.kernel.ExecuteThread.run(ExecuteThread.java:151)

The ThreadPoolPercentSocketReaders attribute sets the maximum percentage of execute threads that are set to read messages from a java socket. The optimal value for this attribute is application-specific. The default value is 33, and the valid range is 1 to 99.

 

Allocating execute threads to act as socket reader threads increases the speed and the ability of the server to accept client requests. It is essential to balance the number of execute threads that are devoted to reading messages from a socket and those threads that perform the actual execution of tasks in the server.

In release 8.1, the socket reader threads no longer use “ExecuteThreads” in the default queue.  Instead they have their own thread group named.

Next Steps
The next steps require a further analysis of the thread dump. Look in the thread dump to see what each the threads are doing at the time of the hang. This will help to analyze the next stage of the investigation. For example, if there are many threads involved in JSP compilation, refer to Potential Causes of Server Hang for further diagnosis and actions to test.

Top of Page

 

 

备注:

本文转载自:http://www.hashei.me/2009/08/java_generic_server_hang.html

                       http://blog.csdn.net/davidhsing/article/details/5854610

你可能感兴趣的:(thread,应用服务器,server,weblogic,socket,application)