先来一些基础操作:
查看内存使用情况:(注意:1. 真实可用内存=free + cached 2.Swap的使用量如果较大,将严重影响应用的性能)
[@yd-80-133 ~]
# free -m
total used
free
shared buffers cached
Mem: 96636 96400 235 0 522 75056
-/+ buffers
/cache
: 20821 75814
Swap: 8189 49 8139
|
查看磁盘使用情况:(如果你部署应用的磁盘使用率100%,你的应用就会变得不可用)
[@yd-81-74 ~]
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1
3.9G 977M 2.8G 26% /
/dev/sda6
1.4T 194G 1.1T 16%
/opt
/dev/sda3
3.9G 2.4G 1.3G 66%
/var
/dev/sda5
4.9G 3.0G 1.7G 64%
/usr
tmpfs 12G 38M 12G 1%
/dev/shm
10.13.81.44:
/data/scribelog
21T 7.3T 13T 37%
/opt/scribelog
|
查看系统概况:(top命令,可以看到很多信息。shift+p按cpu倒序,shift+m按内存倒序,1查看每个cpu繁忙程度)
top
- 16:38:58 up 1019 days, 1:53, 28
users
, load average: 0.77, 0.53, 0.56
Tasks: 325 total, 1 running, 324 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.3%sy, 0.0%ni, 98.6%
id
, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24659996k total, 22502624k used, 2157372k
free
, 118628k buffers
Swap: 4192956k total, 13344k used, 4179612k
free
, 324068k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4033 root 18 0 1996m 975m 13m S 159.0 4.1 6:40.52 java
12336 root 18 0 2020m 1.2g 10m S 9.8 5.3 2860:34 java
3484 root 34 19 0 0 0 S 2.0 0.0 16159:29 kipmi0
7350 root 15 0 12868 1192 740 R 2.0 0.0 0:00.01
top
29636 smc 21 0 1092m 579m 14m S 2.0 2.4 1:20.44 java
30469 smc 21 0 1075m 708m 14m S 2.0 2.9 5:34.31 java
|
查看一个进程有多少线程:
[@yd-81-211 ~]$
ps
-eLf |
grep
24941 |
wc
-l
583
|
查看端口被哪个进程占用:(pid显示的就是那个进程)
[@yd-81-74 ~]
# lsof -i tcp:8080
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
java 30469 smc 239u IPv4 64471973 TCP 10.13.81.74:webcache (LISTEN)
这个命令还可以查看文件被哪个进程占用
[@yd-81-130 nginx]
# lsof | grep /data/log/scribelog
bash
4171 smc cwd DIR 0,23 16384 17797
/data/log/scribelog/user
(10.13.81.44:
/data/scribelog/
)
cat
4172 smc cwd DIR 0,23 16384 17797
/data/log/scribelog/user
(10.13.81.44:
/data/scribelog/
)
|
查看端口是否在监听:(最后一列显示哪个进程在监听这个端口)
[@yd-81-74 ~]
# netstat -nalp | grep 8080
tcp 0 0 10.13.81.74:8080 0.0.0.0:* LISTEN 30469
/java
|
比较两个文件是否一样:
md5sum targetfile.txt > targetfile.md5
把targetfile.md5和targetfile.txt放到同一目录下进行校验:
md5sum -c targetfile.md5
|
下面是我们遇到过的情况,如有错误或需要补充的内容,请直接修改。
内存异常:
♦ java.lang.OutOfMemoryError: PermGen space
》resin热部署,重新加载jar包
》持久代设置得太小:-XX:PermSize=32m -XX:MaxPermSize=64ma
》常见于测试环境,线上基本不会出现这种问题。
♦ java.lang.OutOfMemoryError: Java heap space
》通常情况是从数据库或缓存加载了大量数据或者用户上传了大量文件。
》一般来说,由于会触发GC,只要代码不存在内存泄漏问题,线上很难出现这个异常。
》解决办法:重启、修复代码隐患
♦ java.lang.OutOfMemoryError: GC overhead limit exceeded
》这是因为使用并发收集算法进行GC,并且jvm启动参数中加了-XX:-UseGCOverheadLimit选项。
》目前只在hive的应用中遇到此情况,线上应用一般是CMS算法,不会出现这种情况。
》解决办法:增加heap size或者禁用上面那个选项。
♦ 很多时候内存异常并不会表现为异常,还没有达到这个临界点,你的系统就已经不可用了,这个时候需要主动去检查内存使用情况:
》查询GC状态,看一下jvm是否在进行GC操作(下面是一个示例,等下次碰到典型场景再贴一个):
[@tc-152-92 ~]$ jstat -gcutil 16590 3000
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 85.96 90.60 54.20 3336 0.781 38188 13952.038 13952.819
0.00 0.00 91.42 90.60 54.20 3336 0.781 38189 13952.565 13953.346
0.00 0.00 97.43 90.60 54.20 3336 0.781 38190 13952.960 13953.741
|
》查询java对象内存占用情况,看一下内存里的java对象是否合理(下面只是一个例子,等下次碰到内存占用异常的场景我再贴一个)。
[@zjm-110-88 ~]$ jmap -histo 2234 |
head
-10
num
#instances #bytes class name
----------------------------------------------
1: 3373503 2209452824 [C
2: 3334031 133361240 java.lang.String
3: 260 101301344 [Lcom.caucho.util.LruCache$CacheItem;
4: 326846 63127704 [Ljava.lang.Object;
5: 151274 50828064 com.wap.sohu.mobilepaper.model.NewsContent
6: 19812 45474976 [I
7: 110209 40197776 [B
8: 145988 30902344 [Ljava.util.HashMap$Entry;
9: 1846859 29549744 java.lang.Object
10: 270121 19448712 com.wap.sohu.mobilepaper.model.xml.Image
|
♦ 常用jvm参数:
-XX:MaxPermSize=512m -XX:PermSize=512m -Xss128k
-Xmx4096m -Xms4096m -Xmn1024m
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=85 -XX:+PrintGCDetails
-XX:MaxTenuringThreshold=30
CPU异常:
♦ 运行的线程多了,我们的应用里有不少线程去异步地执行任务,可能某个时间点或事件触发了大量线程同时去执行操作,导致cpu资源紧张。
♦ 程序运行的慢了,比如大量的计算操作,频繁地进行循环遍历。
♦ io操作多,比如频繁地打印日志,频繁地进行网络访问(mysql,memcache)。
♦ 过多的同步操作。比如synchronize
♦ 一般情况下,我们都是通过观察jvm的栈信息来识别程序的异常,主要看java.lang.Thread.State这个值,一般BLOCKED和RUNNABLE都需要重点关注。BLOCKED状态肯定是有锁,比如频繁的IO操作会导致资源BLOCK或者我们代码里显式的加锁。RUNNABLE状态理论上是正常的,但是很有可能是逻辑处理太慢(比如网络io或计算)或调用频繁导致一段代码执行时间较长,这个也需要优化。
[@yd-80-133 ~]$ jstack 1344
2013-06-08 16:15:42
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.8-b03 mixed mode):
"pool-40-thread-5"
prio=10 tid=0x000000005cea6800 nid=0x639c runnable [0x00000000493c5000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.
read
(SocketInputStream.java:129)
at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:114)
at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:161)
at com.mysql.jdbc.util.ReadAheadInputStream.
read
(ReadAheadInputStream.java:189)
- locked <0x000000074c115a08> (a com.mysql.jdbc.util.ReadAheadInputStream)
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3014)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3467)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
- locked <0x0000000770fb3380> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
- locked <0x0000000770fb3380> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2450)
- locked <0x0000000770fb3380> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2006)
- locked <0x0000000770fb3380> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1467)
- locked <0x0000000770fb3380> (a com.mysql.jdbc.JDBC4Connection)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeBatch(NewProxyPreparedStatement.java:1723)
at org.springframework.jdbc.core.JdbcTemplate$4.doInPreparedStatement(JdbcTemplate.java:873)
at org.springframework.jdbc.core.JdbcTemplate$4.doInPreparedStatement(JdbcTemplate.java:1)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:586)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:614)
at org.springframework.jdbc.core.JdbcTemplate.batchUpdate(JdbcTemplate.java:858)
at com.wap.sohu.mobilepaper.dao.statistic.StatisticDao$BatchUpdateTask2.run(StatisticDao.java:285)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
|
♦ 提供一种查看消耗cpu的线程的方式:
1.
找到java进程的
id
号,从top命令里可以看到消耗cpu最多的那个进程。
下面是进程信息:
[@tc
-
152
-
92
~]$ ps
-
ef | grep
18950
smc
18950
1
90
Dec19 ?
21
:
22
:
46
java
-
server
-
Xmx768m
-
Xms768m
-
Xss128k
-
Xmn300m
-
XX:MaxPermSize
=
128m
-
XX:PermSize
=
128m
-
XX:
+
UseConcMarkSweepGC
-
XX:CMSInitiatingOccupancyFraction
=
85
-
XX:
+
PrintGCDetails
-
XX:
+
UseMembar
-
Dhost_home
=
/
opt
/
smc
-
Dserver_log_home
=
/
opt
/
smc
/
log
/
server
-
Xloggc:
/
opt
/
smc
/
log
/
server
/
check_instance_gc.log
-
Dserver_ip
=
10.11
.
152.92
-
Dserver_resources
=
/
opt
/
smc
/
apps
/
server
/
server_apps
/
check_instance
/
resources
/
-
Dserver_name
=
check_instance com.wap.sohu.server.SmcApiServer
8010
2.
查看当时消耗cpu最多的线程:(
-
p 指定java进程号,
-
H 显示所有线程)
[@tc
-
152
-
92
~]$ top
-
p
18950
-
H
PID USER PR NI VIRT RES SHR S
%
CPU
%
MEM TIME
+
COMMAND
18999
smc
16
0
1676m
1.0g
14m
S
32.3
6.5
268
:
50.16
java
18997
smc
16
0
1676m
1.0g
14m
S
30.4
6.5
267
:
47.97
java
18998
smc
17
0
1676m
1.0g
14m
S
30.4
6.5
268
:
15.28
java
19000
smc
16
0
1676m
1.0g
14m
S
30.4
6.5
268
:
06.23
java
19001
smc
15
0
1676m
1.0g
14m
S
5.7
6.5
81
:
34.02
java
3.
保留当时的堆栈信息:
[@tc
-
152
-
92
~]$ jstack
18950
>
18950.txt
4.
把上面最消耗cpu的线程
ID
转成十六进制:
[@tc
-
152
-
92
~]$ python
Python
2.4
.
3
(
#1, Jun 11 2009, 14:09:37)
[GCC
4.1
.
2
20080704
(Red Hat
4.1
.
2
-
44
)] on linux2
Type
"help"
,
"copyright"
,
"credits"
or
"license"
for
more information.
>>>
hex
(
18999
)
'0x4a37'
>>>
5.
在堆栈文件里查找这个十六进制的线程号:
[@tc
-
152
-
92
~]$ vim
18950.txt
"Concurrent Mark-Sweep GC Thread"
prio
=
10
tid
=
0x000000004183b800
nid
=
0x4a39
runnable
"Gang worker#0 (Parallel CMS Threads)"
prio
=
10
tid
=
0x0000000041834000
nid
=
0x4a35
runnable
"Gang worker#1 (Parallel CMS Threads)"
prio
=
10
tid
=
0x0000000041836000
nid
=
0x4a36
runnable
"Gang worker#2 (Parallel CMS Threads)"
prio
=
10
tid
=
0x0000000041837800
nid
=
0x4a37
runnable
"Gang worker#3 (Parallel CMS Threads)"
prio
=
10
tid
=
0x0000000041839800
nid
=
0x4a38
runnable
这里发现都是gc线程在消耗cpu,从最上面的进程信息可以看到这个java进程只开了
1G
的堆,top信息里显示的内存占用已经到
1G
了,正好能对上。
|
代码里的常见异常:
♦ java.lang.OutOfMemoryError: unable to create new native thread
程序里起的线程太多了。一种情况是启动程序的用户(smc)最大可运行线程有限制(ulimit -a查看),另一种情况就是代码里起了很多线程。
当这种异常出现时,我们登录到服务器上当前用户下会出现Resource temporarily unavailable的错误。(如果是root用户启动的进程,就只能重启机器了)
♦ Broken Pipe // TODO...
♦ Too many open files // TODO...
♦ c3p0连接池异常:Attempted to use a closed or broken resource pool
<property name="breakAfterAcquireFailure" value="true"></property>
改成:
<property name="breakAfterAcquireFailure" value="false"></property>
如果参数为true,只要有一次获取数据库连接失败后,整个数据源就会声明为断开并永久关闭,服务就不可用了.
如果参数为false,获取数据库连接失败后程序会抛出异常,但是数据源仍然有效,等下次尝试获取连接成功后就可以正常使用了.
♦ 类初始化异常
java.lang.NoClassDefFoundError: Could not initialize class com.wap.sohu.mobilepaper.util.ClientUserCenterHttpUtils
> 可能是你的classpath路径不对,或类加载顺序的问题(jar冲突)
> 类文件不存在
> 类初始化时抛出未捕获的异常,比如static块或static变量初始化有问题。
在线Debug方式
安装:10.10.76.79:/root/btrace-bin.tar.gz 把这个文件拷到目标机器上,新建一个btrace的目录,解压即可用:tar -zxvf btrace-bin.tar.gz -C btrace/
修改权限:cd bin/; chmod 744 *
import
com.sun.btrace.annotations.
*
;
import
static com.sun.btrace.BTraceUtils.
*
;
import
java.util.
*
;
@BTrace
public
class
Response {
@OnMethod
(clazz
=
"com.sohu.smc.reply.core.LocalCache"
, method
=
"getBulk"
, location
=
@Location(Kind.RETURN))
public static void onGetBulk(String[] keys, @Return
Map
<String,
Object
> result) {
println(
"======================================="
);
printArray(keys);
println(strcat(
"Params keys length:"
,
str
(keys.length)));
println(strcat(
"Result length:"
,
str
(size(result))));
}
/
/
@OnMethod(clazz
=
"com.sohu.smc.reply.core.CommentCursorList"
, method
=
"getIdList"
, location
=
@Location(Kind.RETURN))
/
/
public static void onGetIdList(@Return
List
<
Long
> result) {
/
/
println(strcat(
"ID LIST:"
,
str
(size(result))));
/
/
}
}
|
[@yd
-
80
-
153
btrace]$ .
/
bin
/
btrace
-
cp build
/
13951
samples
/
ThreadStart.java
starting pool
-
7
-
thread
-
1
starting pool
-
17
-
thread
-
1
starting pool
-
6
-
thread
-
2
starting pool
-
7
-
thread
-
2
starting pool
-
7
-
thread
-
3
starting pool
-
7
-
thread
-
4
starting pool
-
6
-
thread
-
3
starting pool
-
6
-
thread
-
4
starting pool
-
6
-
thread
-
5
starting pool
-
6
-
thread
-
6
starting netty
-
io.airlift.http.client
-
cli
-
io
-
boss
-
0
starting netty
-
io.airlift.http.client
-
cli
-
io
-
worker
-
0
starting netty
-
io.airlift.http.client
-
cli
-
io
-
worker
-
1
starting netty
-
io.airlift.http.client
-
cli
-
io
-
worker
-
2
starting netty
-
io.airlift.http.client
-
cli
-
io
-
worker
-
3
starting netty
-
io.airlift.http.client
-
cli
-
io
-
worker
-
4
|