ANR故障是Android开发中的重点难点问题,而traces.txt日志的分析则是解决ANR问题的关键所在。有很多人对于traces.txt日志中的诸多线程状态参数一知半解,不知所云。本文章对这些状态参数结合源码进行全面的解读,相信对于解决ANR问题大有裨益(本文并不提供解决ANR问题本身的思路和方法)。
执行 adb shell kill -3 pid或者当进程发生anr、native crash等故障的时候,系统会生成traces日志文件,默认输出到 /data/anr/ 目录下。traces日志主要由发生anr时的资源使用情况以及各个线程的状态组成。
----- pid 8072 at 2020-03-22 10:23:06 -----
Cmd line: com.zte.camera
Build fingerprint: 'ZTE/GEN_CN_P439S01/P439S01:9/PKQ1.180929.001/20181106.134337:userdebug/test-keys'
ABI: 'arm'
...
Heap: 41% free, 11MB/20MB; 186365 objects
...
Total memory 20MB
Max memory 512MB
Zygote space size 1584KB
...
suspend all histogram: Sum: 191us 99% C.I. 4us-62us Avg: 21.222us Max: 62us
DALVIK THREADS (24):
"Signal Catcher" daemon prio=5 tid=3 Runnable
| group="system" sCount=0 dsCount=0 flags=0 obj=0x14540088 self=0xeb6cb200
| sysTid=8078 nice=0 cgrp=default sched=0/0 handle=0xe5dcb970
| state=R schedstat=( 59094373 1072863 12 ) utm=4 stm=1 core=4 HZ=100
| stack=0xe5cd0000-0xe5cd2000 stackSize=1010KB
| held mutexes= "mutator lock"(shared held)
native: #00 pc 002d975f /system/lib/libart.so (art::DumpNativeStack(std::__1::basic_ostream>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+134)
native: #01 pc 0036e91b /system/lib/libart.so (art::Thread::DumpStack(std::__1::basic_ostream>&, bool, BacktraceMap*, bool) const+210)
native: #02 pc 0036b0d3 /system/lib/libart.so (art::Thread::Dump(std::__1::basic_ostream>&, bool, BacktraceMap*, bool) const+34)
native: #03 pc 00383d89 /system/lib/libart.so (art::DumpCheckpoint::Run(art::Thread*)+624)
native: #04 pc 0037e06f /system/lib/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+314)
native: #05 pc 0037d767 /system/lib/libart.so (art::ThreadList::Dump(std::__1::basic_ostream>&, bool)+758)
native: #06 pc 0037d39f /system/lib/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream>&)+614)
native: #07 pc 00357669 /system/lib/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream>&)+120)
native: #08 pc 00360689 /system/lib/libart.so (art::SignalCatcher::HandleSigQuit()+1040)
native: #09 pc 0035f81b /system/lib/libart.so (art::SignalCatcher::Run(void*)+246)
native: #10 pc 00071db1 /system/lib/libc.so (__pthread_start(void*)+22)
native: #11 pc 0001de65 /system/lib/libc.so (__start_thread+24)
(no managed stack frames)
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x74cb56d8 self=0xeb6ca000
| sysTid=8072 nice=-10 cgrp=default sched=0/0 handle=0xeffe9494
| state=S schedstat=( 1457862132 142332556 1571 ) utm=114 stm=31 core=4 HZ=100
| stack=0xff195000-0xff197000 stackSize=8MB
| held mutexes=
kernel: __switch_to+0x90/0xe8
kernel: futex_wait_queue_me+0xc4/0x13c
kernel: futex_wait+0xe4/0x204
kernel: do_futex+0x168/0x9a0
kernel: compat_SyS_futex+0xf0/0x174
kernel: __sys_trace+0x4c/0x4c
native: #00 pc 00019e8c /system/lib/libc.so (syscall+28)
其中以下片段为线程的各项状态参数,是本文将要解析的部分。
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x74cb56d8 self=0xeb6ca000
| sysTid=8072 nice=-10 cgrp=default sched=0/0 handle=0xeffe9494
| state=S schedstat=( 1457862132 142332556 1571 ) utm=114 stm=31 core=4 HZ=100
| stack=0xff195000-0xff197000 stackSize=8MB
| held mutexes=
输出线程的状态参数所涉及的核心代码为Thread::DumpState函数(代码位于/art/runtime/thread.cc文件),以下深入全面解析各行参数的涵义:
(1)"main" prio=5 tid=1 Native
线程名。"main"表示该线程为主线程;"Binder:8072_2"表示为线程与进程8072关联的Binder线程;"Signal Catcher" daemon表示为守护线程Signal Catcher。
os << '"' << *thread->tlsPtr_.name << '"';
线程优先级。通过java.lang.Thread.setPriority设置,取值范围 [1, 10],低优先级 -> 高优先级。
os << " prio=" << priority
进程内部线程id。取值一般比较小。
os << " tid=" << thread->GetThreadId();
线程状态。具体取值及意义如下图。
os << " " << thread->GetState();
(2)group="main" sCount=1 dsCount=0 flags=1 obj=0x74cb56d8 self=0xeb6ca000
线程所属的线程组。Java中使用ThreadGroup来表示线程组,它可以对一批线程进行分类管理,Java允许程序直接对线程组进行控制。默认情况下,所有的线程都属于主线程组。我们也可以给线程设置分组。
os << " | group=\"" << group_name << "\"";
线程挂起次数。
os << " sCount=" << thread->tls32_.suspend_count;
用于调试的线程挂起次数。
os << " dsCount=" << thread->tls32_.debug_suspend_count;
os << " flags=" << thread->tls32_.state_and_flags.as_struct.flags;
当前线程关联的java线程对象。
os << " obj=" << reinterpret_cast(thread->tlsPtr_.opeer);
当前线程地址。
os << " self=" << reinterpret_cast(thread) << "\n";
(3)sysTid=8072 nice=-10 cgrp=default sched=0/0 handle=0xeffe9494
线程号,Linux内核分配的id。在系统中是唯一编号的。
os << " | sysTid=" << tid;
线程的调度优先级。通过android.os.Process.setThreadPriority设置,取值范围 [-20, 19],高优先级 -> 低优先级。
os << " nice=" << getpriority(PRIO_PROCESS, tid);
程所属的进程调度组。Control groups,是Linux内核的一个功能,用来限制,控制与分离一个进程组群的资源(如CPU、内存、磁盘输入输出等)。
os << " cgrp=" << scheduler_group_name;
分别标志了线程的调度策略和优先级。
os << " sched=" << policy << "/" << sp.sched_priority;
线程的处理函数地址。
os << " handle=" << reinterpret_cast(thread->tlsPtr_.pthread_self);
(4)state=S schedstat=( 1457862132 142332556 1571 ) utm=114 stm=31 core=4 HZ=100
线程的调度状态。
os << " | state=" << native_thread_state;
CPU调度时间统计,分别表示线程在cpu上执行的时间、线程的等待时间和线程执行的时间片长度。可通过adb shell cat /proc/[pid]/task/[tid]/schedstat查看。
os << " schedstat=( " << scheduler_stats << " )";
用户态/内核态下使用CPU时间(单位是jiffies)。adb shell cat /proc/[pid]/task/[tid]/stat查看。
os << " utm=" << utime
<< " stm=" << stime;
最后执行该线程的cpu核的序号 。
os << " core=" << task_cpu;
全局变量jiffies用来记录自系统启动以来产生的中断总数。系统启动时,内核将该变量初始化为0,之后每次时钟中断处理程序都会增加该变量的值。一秒内时钟中断的次数等于Hz。Tick是HZ的倒数,表示每发生一次时钟中断所需的时间。HZ=100,那么Tick=10ms。每秒增加的jiffies数等于Hz。系统运行时间以秒为单位,等于jiffies/Hz。
os << " HZ=" << sysconf(_SC_CLK_TCK) << "\n";
(5)stack=0xff195000-0xff197000 stackSize=8MB
线程栈的地址区间。
os << " | stack=" << reinterpret_cast(thread->tlsPtr_.stack_begin) << "-"
<< reinterpret_cast(thread->tlsPtr_.stack_end);
线程栈的大小。stack地址区间的距离(0xff197000-0xff195000 = 8192KB)。Android中主线程栈大小默认为8M,子线程栈大小稍微小于1M。
os << PrettySize(thread->tlsPtr_.stack_size) << "\n";
(6)held mutexes=
线程所持有mutex类型。分为独占锁exclusive和共享锁shared两类。
os << " | held mutexes=";
for (size_t i = 0; i < kLockLevelCount; ++i) {
if (i != kMonitorLock) {
BaseMutex* mutex = thread->GetHeldMutex(static_cast(i));
if (mutex != nullptr) {
os << " \"" << mutex->GetName() << "\"";
if (mutex->IsReaderWriterMutex()) {
ReaderWriterMutex* rw_mutex = down_cast(mutex);
if (rw_mutex->GetExclusiveOwnerTid() == tid) {
os << "(exclusive held)";
} else {
os << "(shared held)";
}
}
}
}
}
以下为/art/runtime/thread.cc->Thread::DumpState函数的主要代码。如需查看完整源码,请移步http://androidxref.com/9.0.0_r3/xref/art/runtime/thread.cc。
void Thread::DumpState(std::ostream& os, const Thread* thread, pid_t tid) {
std::string group_name;
int priority;
bool is_daemon = false;
Thread* self = Thread::Current();
if (thread != nullptr) {
os << '"' << *thread->tlsPtr_.name << '"';
if (is_daemon) {
os << " daemon";
}
os << " prio=" << priority
<< " tid=" << thread->GetThreadId()
<< " " << thread->GetState();
if (thread->IsStillStarting()) {
os << " (still starting up)";
}
os << "\n";
} else {
os << '"' << ::art::GetThreadName(tid) << '"'
<< " prio=" << priority
<< " (not attached)\n";
}
if (thread != nullptr) {
MutexLock mu(self, *Locks::thread_suspend_count_lock_);
os << " | group=\"" << group_name << "\""
<< " sCount=" << thread->tls32_.suspend_count
<< " dsCount=" << thread->tls32_.debug_suspend_count
<< " flags=" << thread->tls32_.state_and_flags.as_struct.flags
<< " obj=" << reinterpret_cast(thread->tlsPtr_.opeer)
<< " self=" << reinterpret_cast(thread) << "\n";
}
os << " | sysTid=" << tid
<< " nice=" << getpriority(PRIO_PROCESS, tid)
<< " cgrp=" << scheduler_group_name;
if (thread != nullptr) {
int policy;
sched_param sp;
#if !defined(__APPLE__)
policy = sched_getscheduler(tid);
#else
CHECK_PTHREAD_CALL(pthread_getschedparam, (thread->tlsPtr_.pthread_self, &policy, &sp), __FUNCTION__);
#endif
os << " sched=" << policy << "/" << sp.sched_priority
<< " handle=" << reinterpret_cast(thread->tlsPtr_.pthread_self);
}
os << "\n";
// Grab the scheduler stats for this thread.
std::string scheduler_stats;
if (ReadFileToString(StringPrintf("/proc/self/task/%d/schedstat", tid), &scheduler_stats)) {
scheduler_stats.resize(scheduler_stats.size() - 1); // Lose the trailing '\n'.
} else {
scheduler_stats = "0 0 0";
}
char native_thread_state = '?';
int utime = 0;
int stime = 0;
int task_cpu = 0;
GetTaskStats(tid, &native_thread_state, &utime, &stime, &task_cpu);
os << " | state=" << native_thread_state
<< " schedstat=( " << scheduler_stats << " )"
<< " utm=" << utime
<< " stm=" << stime
<< " core=" << task_cpu
<< " HZ=" << sysconf(_SC_CLK_TCK) << "\n";
if (thread != nullptr) {
os << " | stack=" << reinterpret_cast(thread->tlsPtr_.stack_begin) << "-"
<< reinterpret_cast(thread->tlsPtr_.stack_end) << " stackSize="
<< PrettySize(thread->tlsPtr_.stack_size) << "\n";
// Dump the held mutexes.
os << " | held mutexes=";
for (size_t i = 0; i < kLockLevelCount; ++i) {
if (i != kMonitorLock) {
BaseMutex* mutex = thread->GetHeldMutex(static_cast(i));
if (mutex != nullptr) {
os << " \"" << mutex->GetName() << "\"";
if (mutex->IsReaderWriterMutex()) {
ReaderWriterMutex* rw_mutex = down_cast(mutex);
if (rw_mutex->GetExclusiveOwnerTid() == tid) {
os << "(exclusive held)";
} else {
os << "(shared held)";
}
}
}
}
}
os << "\n";
}
}
希望能给大家带来帮助。
如有错误,欢迎指正。
参考资料:
http://gityuan.com/2016/11/26/art-trace/
http://androidxref.com/9.0.0_r3/xref/art/runtime/thread.cc
https://www.jianshu.com/p/cf504cf98750
https://blog.csdn.net/a740169405/article/details/98178287
https://www.cnblogs.com/hebao0514/p/4510498.html
https://blog.csdn.net/l460133921/article/details/51134213
https://blog.csdn.net/u010154760/article/details/45312471