20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡

(注:这2个问题与阿里云一点关系没有)

一、503错误

今天13:00~13:10左右,出现了503错误。出错原因是当时的并发请求数超出了IIS应用程序池的队列长度(Queue Length),当时用的是IIS的默认设置1000(见下图)。

20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡_第1张图片

我们将这里的Queue Length由1000改为2000解决了问题(最大可以设置为65535)。

后来发现可以通过 Performance Monitor 监测 "HTTP Service Request queue" -> "Arrival Rate" 来设定 Queue Length。

比如上图中显示"Arrival Rate"的最大值是400,那么Queue Length最好大于400。

看一下当时的负载均衡中一台Web服务器的CPU监控图:

20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡_第2张图片

(红色曲线表示%Processor Time,绿色曲线表示Request Execution Time)

20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡_第3张图片

不知当时这台云服务器发生了什么异常情况?看来503错误的根源是云服务器的CPU异常,已向阿里云提交工单了解情况。

更新:

经过仔细排查,503错误是当时应用程序池崩溃引起的,应用程序池崩溃是Couchbase客户端引起的,当时正在进行Couchbase集群增/减服务器的操作。

证据来自Windows事件日志:

Exception: System.NullReferenceException
Message: Object reference not set to an instance of an object.
StackTrace:    at Hammock.RestClient.CompleteWithQuery(WebQuery query, RestRequest request, RestCallback callback, WebQueryAsyncResult result)
   at Hammock.RestClient.<>c__DisplayClass18.<BeginRequestImpl>b__15(Object sender, WebQueryResponseEventArgs args)
   at System.EventHandler`1.Invoke(Object sender, TEventArgs e)
   at Hammock.Web.WebQuery.OnQueryResponse(WebQueryResponseEventArgs args)
   at Hammock.Web.WebQuery.HandleWebException(WebException exception)
   at Hammock.Web.WebQuery.GetAsyncResponseCallback(IAsyncResult asyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Threading.ExecutionContext.runTryCode(Object userData)
   at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Net.ContextAwareResult.Complete(IntPtr userToken)
   at System.Net.HttpWebRequest.SetResponse(Exception E)
   at System.Net.ConnectionReturnResult.SetResponses(ConnectionReturnResult returnResult)
   at System.Net.Connection.CompleteConnectionWrapper(Object request, Object state)
   at System.Net.PooledStream.ConnectionCallback(Object owningObject, Exception e, Socket socket, IPAddress address)
   at System.Net.ServicePoint.ConnectSocketCallback(IAsyncResult asyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Net.ContextAwareResult.Complete(IntPtr userToken)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
Application: w3wp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
Stack:
   at System.Net.ServicePoint.ConnectSocketCallback(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Net.ContextAwareResult.Complete(IntPtr)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x000007ff0033cbed
Faulting process id: 0x10b4
Faulting application start time: 0x01ce42fb6c5d3e18
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: unknown
Report Id: 30767fd7-aef7-11e2-8bf7-e5d3e0390d57

2.  Couchbase集群CPU占用不均衡

20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡_第4张图片

(Couchbase管理控制台)

(Linux top命令运行结果)

两台Couchbase组建的集群,CPU占用却相差很大,Couchbase版本是2.0.0。

google之后找到High cpu usage in memcached process,原来是Couchbase 2.0.0的bug,升级至最新版Couchbase 2.0.1可以解决这个问题。

升级操作方法:

1. 在两台Couchbase服务器上下载好安装包:wget http://packages.couchbase.com/releases/2.0.1/couchbase-server-enterprise_x86_64_2.0.1.rpm

2. 进入Coucbase管理控制台,从集群中摘掉1台服务器,具体操作方法见 couchbase-getting-started-upgrade-online

3. 升级Couchbase至2.0.1:rpm -U couchbase-server-enterprise_x86_64_2.0.1.rpm (升级之后最好重启一下couchbase服务:service couchbase restart)

4. 将升级后的Couchbase服务器重新加入集群。

5. 对另一台Couchbase服务器进行同样的升级操作。

升级后,问题解决

20130427遇到的2个问题:503错误与Couchbase集群CPU占用不均衡_第5张图片

你可能感兴趣的:(hbase)