There seem to be more and more posts on the forums about jobs ‘stuck’ in the Running state and I have been investigating this problem for a client recently so I thought I would summarise some of the troubleshooting techniques I use. This posting expands on the article I wrote a few years ago about agent_exec.
The problem is usually expressed in the form of ‘DA shows my job is running but I know it’s not’. First of all DA shows a job as ‘Running’ whenever it finds a job whose a_special_app attribute is set to ‘agentexec’. Since agent_exec sets this attribute when it starts and clears it when the job has finished, under normal circumstances this is a quite accurate reflection of whether a job is running or not.
However if the agent_exec processes are interrupted before clearing the attribute (if the box is rebooted or the content server hangs for instance) then the job object can be left with a_special_app = ‘agentexec’ and DA shows the job as running.
Of course the agent_exec attempts to deal with such a situation. Every time it wakes up to perform some processing it first runs a ‘garbage_collect_jobs’ routine. You won’t see much evidence of this in the logs unless you turn on agent_exec tracing (see my job scheduler article for details on how to do this). You will get the follow lines when there is nothing to garbage collect:
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] garbage_collect_jobs Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec: execquery,s0,F, SELECT ALL r_object_id, a_last_invocation, ... Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_get: getlastcoll,s0 Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_next: next,s0,q0 Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec: close,s0,q0
Basically agent_exec runs the following query:
SELECT ALL r_object_id, a_last_invocation, a_last_completion, a_special_app FROM dm_job WHERE ( ( (a_last_invocation IS NOT NULLDATE) AND (a_last_completion IS NULLDATE)) OR (a_special_app = 'agentexec')) AND (i_is_reference = 0 OR i_is_reference is NULL) AND (i_is_replica = 0 OR i_is_replica is NULL)
If jobs are returned from this query and agent_exec can not match the job with an existing running job it will clean up the job object, unsetting a_special_app and setting a_last_invocation to the current time.
Here is some typical trace output in the agentexec.log file when I set the dm_LogPurge a_special_app attribute to agentexec.
This output show that this is the source of the infamous messageDetected while processing dead job dm_LogPurge: The job object indicated the job was in progress, but the job was not actually running. It is likely that the dm_agent_exec utility was stopped while the job was in progress.
Examining the agentexec trace is usually enough to figure out where the problems lies however in extreme cases it is useful to look at the dmcl trace for the agentexec process to further troubleshoot issues. In principle you can do this by setting the dmcl.ini trace_file parameter to an existing directory on the Content Server. However this has the disadvantage of turning on tracing for all dmcl processes on the content server i.e. all jobs and methods.
What we really want to do is isolate the agentexec process from all others and in this section I tell you how. I present the steps along with explanations for a typical Windows server. The same principle applies to *nix servers usually with a suitable change of folder paths.
First force the agent exec to stop . You can do this by killing the main agent_exec process repeatedly. The Content Server will detect that the agent exec dies and try and restart it, however there is a limit to the number of times this will happen (seems to be 5 by default). Eventually you get the following message in the content server log and the dm_agent_exec stays dead:
Thu Jan 17 13:35:37 2008 984000 [DM_SESSION_W_AGENT_EXEC_FAILURE_EXCEED]warning: "The failure limit of the agent exec program has exceeded. It will not be restarted again. Please correct the problem and restart the server."
Copy the agent_exec executable to a separate directory . Copy the program file %DM_HOME%\bin\dm_agent_exec.exe to a new directory e.g. c:\Documentum\agentexec.
Copy the dmcl.ini . Copy the main dmcl.ini file in c:\windows to c:\Documentum\agentexec. Now edit the file and add the following lines:
trace_level = 10
trace_file = c:\Documentum\agentexec
We are going to take advantage of the fact that the first place the dmcl looks for the dmcl.ini is in the current working directory.
Start the agent_exec from the command line . Use the following syntax:
dm_agent_exec -docbase_name docbase -docbase_owner dmadmin -trace_level 1
Agent exec logging and trace output will continue to appear in the %DOCUMENTUM%\dba\log\agentexec\agentexec.log, however a number of dmcl trace files will also be created in C:\Documentum\agentexec directory. One of these (probably the largest) will be the dmcl trace for the main agent_exec process; remember agent_exec works by forking off a new dm_agent_exec process to manage each running job – each of these processes will have its own dmcl trace file.
When you have finished tracing the agentexec you will need to kill the command line process and restart the Content Server (if anyone knows how to force the content server to restart the agentexec after the failure limit has been reached I’d love to know).
With a clear understanding of how agent_exec works and with the trace output available it should be possible to troubleshoot and resolve just about any job scheduler related problem.
转自:http://robineast.wordpress.com/2008/01/17/troubleshooting-agent_exec-garbage-collection/