Sensor记录日志导致的亮屏慢问题分析

一. 问题描述

1.1 现象

      手机解锁卡顿

1.2 结论

    diag记录数据慢导致系统卡顿

二. 问题分析


2.1日志分析

    在大量的日志中发现Slow Looper的日志,都是DisplayPowerController的DisplayControllerHandler处理消息超时,并且有大量的dvm_lock_sample提示registerListenerImpl耗时了,亮屏过程中无论system_server还是systemui都会去注册很多sensor,由于大量异常的

registerListenerImpl慢不得不怀疑是sensor注册慢导致的:

关键日志1

11-19 20:03:43.936 1000 1683 1903 W Looper : Slow Looper PowerManagerService: Long Msg: seq=93258 plan=20:03:40.597 late=0ms wall=3340ms running=9ms runnable=6ms h=com.android.server.display.DisplayPowerController$DisplayControllerHandler w=1

 

 

关键日志2

11-19 20:41:21.495 1000 2834 18488 I dvm_lock_sample: 43,1751*,SystemSensorManager.java,183,boolean android.hardware.SystemSensorManager.registerListenerImpl(android.ha rdware.SensorEventListener, android.hardware.Sensor, int, android.os.Handler, intint),,183,boolean android.hardware.SystemSensorManager.registerListenerImpl(android.hardware.SensorEventListener, androi d.hardware.Sensor, int, android.os.Handler, intint),100">com.android.systemui,0,AsyncTask #43,1751,SystemSensorManager.java,183,boolean android.hardware.SystemSensorManager.registerListenerImpl(android.ha rdware.SensorEventListener, android.hardware.Sensor, int, android.os.Handler, intint),,183,boolean android.hardware.SystemSensorManager.registerListenerImpl(android.hardware.SensorEventListener, androi d.hardware.Sensor, int, android.os.Handler, intint),100

怀疑归怀疑,但是有没有直接的证据来证明是sensor注册慢导致的,光看这日志没法继续分析到底是哪里慢了

2.2复现分析

    在想办法复现的过程中,刚好被拉进一卡顿群,说有个卡顿的问题想找framework的同学帮忙看一下。那我就去看了一下,发现他们有一个大概率复现的方法。打开打开小爱同学(其实和小爱没关系),设置指纹解锁,然后不断的指纹解锁就能复现卡顿的问题了。

    我就用这个方法去复现,发现还真的很好复现,复现之后当然需要知道在哪里卡顿了,这时候十八般武艺可以轮番上阵了:

1.查看system_server卡住的原因

    既然是system_server卡顿,那么就在卡顿的时候抓system_server 的trace,具体命令为kill -3 pid(system_server的pid),然后我抓到的关键trace如下:

"main" prio=5 tid=1 Native

  | group="main" sCount=1 dsCount=0 flags=1 obj=0x7610b3c8 self=0x7153469c00

  | sysTid=1636 nice=-2 cgrp=default sched=0/0 handle=0x71d94d7ed0

  | state=S schedstat=( 12607677644 3411181337 21636 ) utm=905 stm=355 core=5 HZ=100

  | stack=0x7fc32bf000-0x7fc32c1000 stackSize=8192KB

  | held mutexes=

  kernel: (couldn't read /proc/self/task/1636/stack)

  native: #00 pc 00000000000c14a4  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4)

  native: #01 pc 000000000007bb2c  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+132)

  native: #02 pc 000000000008a338  /system/lib64/libhidlbase.so (android::hardware::IPCThreadState::transact(int, unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int)+1772)

  native: #03 pc 0000000000085824  /system/lib64/libhidlbase.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+72)

  native: #04 pc 000000000000c8d8  /system/lib64/android.hardware.sensors@1.0.so (android::hardware::sensors::V1_0::BpHwSensors::_hidl_activate(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, int, bool)+236)

  native: #05 pc 000000000001b8e4  /system/lib64/libsensorservice.so (android::SensorDevice::activateLocked(void*, intint)+784)

  native: #06 pc 000000000001ca78  /system/lib64/libsensorservice.so (android::SensorDevice::activate(void*, intint)+72)

  native: #07 pc 0000000000031530  /system/lib64/libsensorservice.so (android::SensorService::enable(android::sp const&, intlonglongint, android::String16 const&)+1972)

  native: #08 pc 0000000000026194  /system/lib64/libsensorservice.so (android::SensorService::SensorEventConnection::enableDisable(int, bool, longlongint)+216)

  native: #09 pc 00000000001a84f8  /system/lib64/libandroid_runtime.so ((anonymous namespace)::nativeEnableSensor(_JNIEnv*, _jclass*, longintintint)+116)

  at android.hardware.SystemSensorManager$BaseEventQueue.nativeEnableSensor(Native method)

  at android.hardware.SystemSensorManager$BaseEventQueue.enableSensor(SystemSensorManager.java:761)

  at android.hardware.SystemSensorManager$BaseEventQueue.addSensor(SystemSensorManager.java:686)

  at android.hardware.SystemSensorManager.registerListenerImpl(SystemSensorManager.java:192)

  - locked <0x088006b6> (a java.util.HashMap)

  at android.hardware.SensorManager.registerListener(SensorManager.java:823)

  at android.hardware.SensorManager.registerListener(SensorManager.java:730)

  at miui.util.ProximitySensorWrapper.registerListener(ProximitySensorWrapper.java:96)

  - locked <0x01fcf5a1> (a java.util.ArrayList)

  at com.android.server.policy.MiuiScreenOnProximityLock.aquire(MiuiScreenOnProximityLock.java:115)

  - locked <0x0f945324> (a com.android.server.policy.MiuiScreenOnProximityLock)

  at com.android.server.policy.BaseMiuiPhoneWindowManager.startedWakingUp(BaseMiuiPhoneWindowManager.java:424)

  at com.android.server.power.Notifier$2.run(Notifier.java:460)

  at android.os.Handler.handleCallback(Handler.java:883)

  at android.os.Handler.dispatchMessage(Handler.java:100)

  at android.os.Looper.loop(Looper.java:224)

  at com.android.server.SystemServer.run(SystemServer.java:556)

  at com.android.server.SystemServer.main(SystemServer.java:362)

  at java.lang.reflect.Method.invoke(Native method)

  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:539)

  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:928)

 

 

 

 

"PowerManagerService" prio=5 tid=34 Native

  | group="main" sCount=1 dsCount=0 flags=1 obj=0x143c1770 self=0x713aabf800

  | sysTid=1834 nice=-4 cgrp=default sched=0/0 handle=0x70d5919d50

  | state=S schedstat=( 1568883166 1516633035 9751 ) utm=101 stm=55 core=6 HZ=100

  | stack=0x70d5817000-0x70d5819000 stackSize=1039KB

  | held mutexes=

  kernel: (couldn't read /proc/self/task/1834/stack)

  native: #00 pc 000000000007067c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)

  native: #01 pc 0000000000073be8  /apex/com.android.runtime/lib64/bionic/libc.so (__futex_wait_ex(void volatile*, bool, int, bool, timespec const*)+140)

  native: #02 pc 00000000000d7a68  /apex/com.android.runtime/lib64/bionic/libc.so (NonPI::MutexLockWithTimeout(pthread_mutex_internal_t*, bool, timespec const*)+212)

  native: #03 pc 000000000002fb24  /system/lib64/libsensorservice.so (android::SensorService::createSensorEventConnection(android::String8 const&, int, android::String16 const&)+80)

  native: #04 pc 000000000001177c  /system/lib64/libsensor.so (android::SensorManager::createEventQueue(android::String8, int)+96)

  native: #05 pc 00000000001a81fc  /system/lib64/libandroid_runtime.so ((anonymous namespace)::nativeInitSensorEventQueue(_JNIEnv*, _jclass*, long, _jobject*, _jobject*, _jstring*, int)+160)

  at android.hardware.SystemSensorManager$BaseEventQueue.nativeInitBaseEventQueue(Native method)

  at android.hardware.SystemSensorManager$BaseEventQueue.(SystemSensorManager.java:666)

  at android.hardware.SystemSensorManager$SensorEventQueue.(SystemSensorManager.java:798)

  at android.hardware.SystemSensorManager.registerListenerImpl(SystemSensorManager.java:191)

  - locked <0x05548e8d> (a java.util.HashMap)

  at android.hardware.SensorManager.registerListener(SensorManager.java:823)

  at com.android.server.display.AutomaticBrightnessController.setLightSensorEnabled(AutomaticBrightnessController.java:511)

  at com.android.server.display.AutomaticBrightnessController.configure(AutomaticBrightnessController.java:367)

  at com.android.server.display.DisplayPowerController.updatePowerState(DisplayPowerController.java:976)

  at com.android.server.display.DisplayPowerController.access$600(DisplayPowerController.java:92)

  at com.android.server.display.DisplayPowerController$DisplayControllerHandler.handleMessage(DisplayPowerController.java:2027)

  at android.os.Handler.dispatchMessage(Handler.java:107)

  at android.os.Looper.loop(Looper.java:224)

  at android.os.HandlerThread.run(HandlerThread.java:67)

  at com.android.server.ServiceThread.run(ServiceThread.java:45)

我们能发现PowerMangerService在亮屏的时候去注册sensor卡住了,在等一个锁。那么我们继续看看这个锁被哪个线程拿着的,最后发现是main现场拿着这个锁。main线程拿着这个锁去注册activate 对应的sensor,那么我们就得去看看对端为什么卡住了。

 

2.查看[email protected]卡在的原因

   查看native进程的trace我们一般用的是debuggerd -b pid([email protected]进程的pid),我抓取的关键trace如下:

"HwBinder:804_1" sysTid=859

    #00 pc 00000000000c2aa8  /apex/com.android.runtime/lib64/bionic/libc.so (write+8) (BuildId: 084c8a81b8c78e19cd9a1ff6208e77cf)

    #01 pc 0000000000019eac  /vendor/lib64/libdiag.so (log_commit+68) (BuildId: 18b6916c12e26ffe8854638aa34a6050)

    #02 pc 0000000000005920  /vendor/lib64/libsnsdiaglog.so (sensors_diag_log::submit_log_packet(sensors_diag_logid, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&)+380) (BuildId: 52c9f5d16b9d183648710f49b3da4510)

    #03 pc 000000000000531c  /vendor/lib64/libsnsdiaglog.so (sensors_diag_log::log_request_msg(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, char*)+556) (BuildId: 52c9f5d16b9d183648710f49b3da4510)

    #04 pc 00000000000444b4  /vendor/lib64/sensors.ssc.so (ssc_sensor::send_sensor_config_request()+1292) (BuildId: 761371c66dd776f38c4ae0949238b119)

    #05 pc 0000000000043ee0  /vendor/lib64/sensors.ssc.so (ssc_sensor::activate()+528) (BuildId: 761371c66dd776f38c4ae0949238b119)

    #06 pc 000000000003c0a4  /vendor/lib64/sensors.ssc.so (sensors_hal::activate(intint)+380) (BuildId: 761371c66dd776f38c4ae0949238b119)

    #07 pc 0000000000007d60  /vendor/lib64/hw/android.hardware.sensors@1.0-impl.so (android::hardware::sensors::V1_0::implementation::Sensors::activate(int, bool)+28) (BuildId: df21e23627085ff60f326db9a312a9ce)

    #08 pc 000000000000f9d4  /system/lib64/vndk-29/android.hardware.sensors@1.0.so (android::hardware::sensors::V1_0::BnHwSensors::_hidl_activate(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+204) (BuildId: 6fcc0903c06dcd887cd709c04289ca69)

    #09 pc 0000000000010c18  /system/lib64/vndk-29/android.hardware.sensors@1.0.so (android::hardware::sensors::V1_0::BnHwSensors::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+948) (BuildId: 6fcc0903c06dcd887cd709c04289ca69)

    #10 pc 0000000000084dd8  /system/lib64/vndk-sp-29/libhidlbase.so (android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+72) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

    #11 pc 0000000000088778  /system/lib64/vndk-sp-29/libhidlbase.so (android::hardware::IPCThreadState::getAndExecuteCommand()+1040) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

    #12 pc 0000000000089990  /system/lib64/vndk-sp-29/libhidlbase.so (android::hardware::IPCThreadState::joinThreadPool(bool)+96) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

    #13 pc 00000000000988d0  /system/lib64/vndk-sp-29/libhidlbase.so (android::hardware::PoolThread::threadLoop()+24) (BuildId: 3684cb22817553ecd1e96bdb428ff263)

    #14 pc 00000000000135f0  /system/lib64/vndk-sp-29/libutils.so (android::Thread::_threadLoop(void*)+328) (BuildId: ae8cf46c4124b6f522dd7de2efc17c1c)

    #15 pc 00000000000d6cb0  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36) (BuildId: 084c8a81b8c78e19cd9a1ff6208e77cf)

    #16 pc 0000000000074eac  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) (BuildId: 084c8a81b8c78e19cd9a1ff6208e77cf)

我们用addr2line工具来找到对应的代码行号:

pzc@pzc-OptiPlex-7050:~/sd/下载/out/raphael-symbols-9.11.16-internal-8ad115968f/

out/target/product/raphael/symbols/vendor/lib64$ addr2line -f -e libdiag.so 0000

000000019eac

_ZL5writeiPKvU17pass_object_size0m

out/soong/.intermediates/bionic/libc/libc.llndk/android_arm64_armv8-a_vendor_shared/gen/include/bits/fortify/unistd.h:201

pzc@pzc-OptiPlex-7050:~/sd/下载/out/raphael-symbols-9.11.16-internal-8ad115968f/

out/target/product/raphael/symbols/vendor/lib64$ addr2line -f -e libsnsdiaglog.s

0000000000005920

_ZN16sensors_diag_log17submit_log_packetE18sensors_diag_logidRNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE

vendor/qcom/proprietary/commonsys-intf/sensors-see/sensors-diag-log/src/sensor_diag_log.cpp:304

我们找到对应的代码:

sensor_diag_log.cpp

263sensors_diag_log_return_code sensors_diag_log::submit_log_packet(

264    sensors_diag_logid internal_log_id,

265    string &pb_encoded_sns_diag_sensor_log)

266{

       .......

304    log_commit(log_pkt);

       .......

309  return return_code;

310}

找到log_commit定义的地方:

diag_lsm_log.c

333void log_commit (void *ptr)

334{

335  if (ptr)

336  {

337      log_header_type *phdr_ptr = NULL;

338      log_commit_last = (void *) ptr;

339    /* Set pointer to begining of diag pkt, not the log */

340      phdr_ptr = (log_header_type *)ptr;

341      ptr = ((byte *) ptr - (LOG_DIAGPKT_OFFSET+DIAG_REST_OF_DATA_POS)); /* WM7 prototyping */

342      if(-1 != diag_fd)

343      {

344       int NumberOfBytesWritten = 0;

345    if((NumberOfBytesWritten = write(diag_fd, (const void*) ptr, DIAG_REST_OF_DATA_POS+LOG_DIAGPKT_OFFSET + phdr_ptr->len)) != 0/*TODO: Check the Numberofbyteswritten against number of bytes we wanted to write?*/

346       {

347     DIAG_LOGE("Diag_LSM_log: Write failed in %s, bytes written: %d, error: %d\n",

348          __func__, NumberOfBytesWritten, errno);

349     log_commit_to_cs_fail++;

350       }

351       DiagSvc_Free(ptr, GEN_SVC_ID);

352      }

353  }

354  return;

355}

我们通过相关的日志还有trace可以定位到是write超时了,一般通过linux系统调用超时了的情况我们可以猜到是这个线程进入了D状态,那么我们继续看看write在kernel中卡在了哪里,首先我们先找到[email protected]进程,然后去看它有哪些子线程

raphael:/ # ps -ef | grep sensor                                                                                                                                                                          

system         604     1 0 16:12:41 ?     00:00:00 sscrpcd sensorspd

system         775     1 0 16:12:41 ?     00:00:37 android.hardware.sensors@1.0-service

system         802     1 0 16:12:41 ?     00:00:00 vendor.qti.hardware.sensorscalibrate@1.0-service

system         820     1 0 16:12:41 ?     00:00:11 vendor.xiaomi.hardware.citsensorservice@1.1-service

system         926     1 0 16:12:42 ?     00:00:02 sensors.qti

root         13571 19214 2 10:18:01 pts/1 00:00:00 grep sensor

raphael:/ # cd /pr

proc/             product/          product_services

raphael:/ # cd /proc/775/task/                                                                                                                                                                            

raphael:/proc/775/task # ls

10249 10251 10253 13307 13308 13309 13321 13322 13323 1842 1845 1846 1847 1848 1849 1850 5737 5738 5740 775 840 842 848

我们去复现问题,然后分别查看tid比较小的线程(一部分是hwbinder线程),然后再看对应的线程状态

cat 1850/stack && cat 5737/stack && cat 5738/stack && cat 5740/stack && cat 775/stack && cat 840/stack && cat 842/stack && cat 848/stack && ps -AT | grep " D "                                                               

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] diagchar_read+0x134/0x1640

[<0000000000000000>] __vfs_read+0x44/0x128

[<0000000000000000>] vfs_read+0xa0/0x138

[<0000000000000000>] SyS_read+0x54/0xb8

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] do_sigtimedwait+0x158/0x338

[<0000000000000000>] SyS_rt_sigtimedwait+0xd0/0x178

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] diagchar_write+0xcf4/0x1538

[<0000000000000000>] __vfs_write+0x44/0x130

[<0000000000000000>] vfs_write+0xc8/0x188

[<0000000000000000>] SyS_write+0x54/0xb8

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] do_sigtimedwait+0x158/0x338

[<0000000000000000>] SyS_rt_sigtimedwait+0xd0/0x178

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] diagchar_write+0x7a0/0x1538

[<0000000000000000>] __vfs_write+0x44/0x130

[<0000000000000000>] vfs_write+0xc8/0x188

[<0000000000000000>] SyS_write+0x54/0xb8

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] futex_wait_queue_me+0xbc/0x108

[<0000000000000000>] futex_wait+0x278/0x448

[<0000000000000000>] do_futex+0x138/0x17e0

[<0000000000000000>] SyS_futex+0x130/0x1b0

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] do_sys_poll+0x5c4/0x700

[<0000000000000000>] SyS_ppoll+0x174/0x268

[<0000000000000000>] el0_svc_naked+0x34/0x38

[<0000000000000000>] 0xffffffffffffffff

system         775   775     1  476464   6812 diagchar_write      0 D sensors@1.0-ser

system         775  1842     1  476464   6812 diagchar_read       0 D sensors@1.0-ser

system         820  1548     1  197708   7136 diagchar_read       0 D citsensorservic

system         926   944     1  124940   4252 diagchar_read       0 D sensors.qti

system        1071  1079     1   50644   4660 diagchar_read       0 D cnss_diag

radio         1441  1484     1 1440532  29140 diagchar_read       0 D qcrild

radio         1456  1485     1 1298900  29172 diagchar_read       0 D qcrild

radio         2317  2325     1  182856   3584 diagchar_read       0 D ims_rtp_daemon

shell        14636 14636     1   39684   3988 diagfwd_write       0 D diag_mdlog_syst

shell        16121 16121     1   30352   2568 diagchar_open       0 D diag_mdlog_syst

从这个日志中我们能看到几个信息:

[email protected] 有两个现场进入D状态,一个现场是在write,这个已经可以解释我们系统为什么卡了(因为这里卡住了,导致system_server那边也卡住了,最终系统卡住了)

2.我们还能看到其他的几个线程都是在diagchar_read/diagchar_write的时候进入D状态,还有一个是diagchar_open,并且这个进程是shell进程启动的

查看对应的kernel日志:

[238325.121834] init: Received control message 'start' for 'diag_mdlog_stop' from pid: 16096 (start diag_mdlog_stop)

[238325.127831] init: Received control message 'start' for 'diag_mdlog_stop' from pid: 16097 (start diag_mdlog_stop)

[238325.163024] init: Received control message 'start' for 'diag_mdlog_start' from pid: 16099 (start diag_mdlog_start)

[238325.171552] init: Received control message 'start' for 'diag_mdlog_start' from pid: 16100 (start diag_mdlog_start)

[238330.007418] init: Received control message 'start' for 'diag_mdlog_stop' from pid: 16123 (start diag_mdlog_stop)

[238330.039798] init: Received control message 'start' for 'diag_mdlog_start' from pid: 16124 (start diag_mdlog_start)

[238332.430154] init: Service 'diag_mdlog_stop' (pid 16121) exited with status 0

[238332.439974] init: Service 'diag_mdlog_start' (pid 14636) exited with status 0

我们发现这里在频繁的执行“start diag_mdlog_start”,“start diag_mdlog_stop”(不太正常),这两个命令分别是抓子系统命令的日志,通过这两个关键字查找到对应的代码:

@Override

public void onEventOccur(ExceptionEvent event) {

    if (event == null) {

        Utils.logE(TAG, "event is null!");

        return;

    }

    FeatureEvent featureEvent = (FeatureEvent)event;

    if (FEATURE_SUBCLASS.contains(featureEvent.getSubClass())) {

        ThreadPool.execute(new Runnable() {

            @Override

            public void run() {

                SystemProperties.set("debug.offline_log.enable""true");

                SystemProperties.set("debug.offline_log.module""sensor");

                CaptureLogUtils.getInstance().runCommand("start""diag_mdlog_start"0);

                try {

                    Thread.sleep(50000);

                catch (InterruptedException e) {

                    Utils.logD(TAG, "captureSensorLog failed: " + e.getMessage());

                }

                SystemProperties.set("debug.offline_log.enable""false");

                SystemProperties.set("debug.offline_log.module""null");

                CaptureLogUtils.getInstance().runCommand("start""diag_mdlog_stop"0);

 

                Handler h = mManager.getWorkHandler();

                Message msg = h.obtainMessage(MQSEventManager.MSG_HANDLE_FEATURE_EVENTS, 00, event);

                h.sendMessage(msg);

            }

        });

    }

}

到了这里我们就已经大概可以猜到问题的原因了,肯定是这个重复触发抓diag日志功能引起的问题。那么我们继续分析刚才我们提到的两个问题:

[email protected] 有两个现场进入D状态,一个现场是在write,这个已经可以解释我们系统为什么卡了(因为这里卡住了,导致system_server那边也卡住了,最终系统卡住了)

首先我们通过对应的trace看write在kernel中到底卡在了哪里:

[<0000000000000000>] __switch_to+0xd8/0xf8

[<0000000000000000>] diagchar_write+0x7a0/0x1538

[<0000000000000000>] __vfs_write+0x44/0x130

[<0000000000000000>] vfs_write+0xc8/0x188

[<0000000000000000>] SyS_write+0x54/0xb8

[<0000000000000000>] el0_svc_naked+0x34/0x38

这两个trace是D状态的,那么我们下载对应的vmlinux然后用gdb看一下到底卡在了哪里:

pzc@pzc-OptiPlex-7050:~/code/J11/prebuilts/gdb/linux-x86/bin$ ./gdb

GNU gdb (GDB) 7.11

Copyright (C) 2016 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later //gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"

and "show warranty" for details.

This GDB was configured as "x86_64-linux-gnu".

Type "show configuration" for configuration details.

For bug reporting instructions, please see:

//www.gnu.org/software/gdb/bugs/>.

Find the GDB manual and other documentation resources online at:

//www.gnu.org/software/gdb/documentation/>.

For help, type "help".

Type "apropos word" to search for commands related to "word".

(gdb) file /home/pzc/sd/vmlinux-811877d

Reading symbols from /home/pzc/sd/vmlinux-811877d...done.

(gdb) list diagchar_write+0x7a0/0x1538

Function "diagchar_write+0x7a0/0x1538" not defined.

(gdb) list diagchar_write+0x7a0

Function "diagchar_write+0x7a0" not defined.

(gdb) list *diagchar_write+0x7a0

0xffffff800863e210 is in diagchar_write (/home/work/raphael-q-dev-build/kernel/msm-4.14/drivers/char/diag/diagchar_core.c:3577).

3572    /home/work/raphael-q-dev-build/kernel/msm-4.14/drivers/char/diag/diagchar_core.c: 没有那个文件或目录.

(gdb) list *diagchar_read+0xf4

0xffffff800863c524 is in diagchar_read (/home/work/raphael-q-dev-build/kernel/msm-4.14/drivers/char/diag/diagchar_core.c:3632).

3627    in /home/work/raphael-q-dev-build/kernel/msm-4.14/drivers/char/diag/diagchar_core.c

(gdb)

我们反编译后找到write时和read时分别卡在了diagchar_core.c 3577行和diagchar_core.c 3632行,对应的代码为:

diagchar_core.c

3518static int diag_user_process_apps_data(const char __user *buf, int len,

3519                       int pkt_type)

3520{

        ......

3577    mutex_lock(&driver->apps_data_mutex);

        ......

3600    return 0;

3601}

因为光有这个日志不同好分析谁持有锁,所以我们需要触发一个ramdump再继续分析

2.3 Kernel ramdump分析

光通过日志可能不太好分析,我们再复现触发一个ramdump进行分析:

randump加载参考:RAMDUMP抓取方法

1.加载ramdump

crash vmlinux DDRCS0_0.BIN@0x0000000080000000,DDRCS0_1.BIN@0x

0000000100000000,DDRCS1_0.BIN@0x0000000180000000,DDRCS1_1.BIN@0x0000000200000000 --kaslr 0x000000074f800000

 

 

2.查看D状态进程

crash> ps | grep UN

     76      2   2  ffffffe0336c0000  UN   0.0       0      0  [kworker/u16:1]

    249      2   0  ffffffe030f40000  UN   0.0       0      0  [kworker/u16:5]

    340      2   1  ffffffe032fef000  UN   0.0       0      0  [kworker/u16:9]

    364      2   3  ffffffe02b07c000  UN   0.0       0      0  [kworker/u16:12]

   1134      1   4  ffffffdfff54f000  UN   0.1   50644   5980  cnss_diag

   1519      1   6  ffffffdfd9d08000  UN   0.4 1299028  43208  qcrild

   1600      1   4  ffffffdfc14c7000  UN   0.1  476464   9640  sensors@1.0-ser

   1624      1   6  ffffffdfc1549000  UN   0.1  197708   8860  citsensorservic

   2377      1   0  ffffffdf76f6a000  UN   0.1  182856   5872  ims_rtp_daemon

   5441      1   1  ffffffdf39246000  UN   0.1  476464   9640  HwBinder:789_1

   7756      1   3  ffffffdf79799000  UN   0.0  108400   3844  diag_mdlog_syst

   7764      1   6  ffffffdf87ad3000  UN   0.0  108400   3844  diag_mdlog_syst

 

 

3.查看HwBinder:789_1 D状态现场trace:

crash> bt 5441

PID: 5441   TASK: ffffffdf39246000  CPU: 1   COMMAND: "HwBinder:789_1"

 #0 [ffffff801913bb70] __switch_to at ffffff8757886b64

 #1 [ffffff801913bc10] __schedule at ffffff8758a56ba8

 #2 [ffffff801913bc30] schedule_preempt_disabled at ffffff8758a570d0

 #3 [ffffff801913bcb0] __mutex_lock at ffffff8758a58b88

 #4 [ffffff801913bcc0] __mutex_lock_slowpath at ffffff8758a581ac

 #5 [ffffff801913bcd0] mutex_lock at ffffff8758a58194

 #6 [ffffff801913bda0] diagchar_write at ffffff8757e3e20c

 #7 [ffffff801913be30] __vfs_write at ffffff8757a72898

 #8 [ffffff801913be70] vfs_write at ffffff8757a72c54

 #9 [ffffff801913beb0] sys_write at ffffff8757a72e20

#10 [ffffff801913bff0] el0_svc_naked at ffffff875788487c

     PC: 0000007ba1887aa8   LR: 0000007ba0cc6eb0   SP: 0000007b92c3d8e0

    X29: 0000007b92c3d900  X28: 0000007ba1321e80  X27: 0000007ba067000c

    X26: 0000007ba083a028  X25: 0000007ba1230440  X24: 0000007b92c3e020

    X23: 0000007ba0ab9118  X22: 00000000b630e2ae  X21: 0000000000000078

    X20: 0000007b9f251808  X19: 0000007b9f251800  X18: 0000007b90abe000

    X17: 0000007ba1887aa0  X16: 0000007ba0cd9930  X15: 0000c93977e960fb

    X14: 00245bdc48d5ee00  X13: 656d6f6465700912  X12: 0000007b9f23bc00

    X11: 001a00000000b630  X10: bd9711000000810d   X9: 0000000000000084

     X8: 0000000000000040   X7: e1d98509120a2606   X6: aa726574656d6f64

     X5: 0000007b9f25188c   X4: 0000007ba12277f8   X3: 0000007b9f251810

     X2: 000000000000008c   X1: 0000007b9f251800   X0: 000000000000000d

    ORIG_X0: 000000000000000d  SYSCALLNO: 40  PSTATE: 00000000

 

 

4.查看对应的持锁点对应的代码:

crash> sym ffffff8757e3e20c

ffffff8757e3e20c (t) diagchar_write+1948 /home/pzc/code/F11Q/kernel/msm-4.14/drivers/char/diag/diagchar_core.c: 3576

持锁点为:

mutex_lock(&apps_data_mutex);

 

 

5.查找谁持有了这个锁

apps_data_mutex的定义:

static struct mutex apps_data_mutex;

这是一个全局的锁,我们可以这样看谁持有了这个锁:

crash> p apps_data_mutex

apps_data_mutex = $1 = {

  owner = {

    counter = -136948846591

  },

  wait_lock = {

    {

      rlock = {

        raw_lock = {

          owner = 92,

          next = 92

        }

      }

    }

  },

  osq = {

    tail = {

      counter = 0

    }

  },

  wait_list = {

    next = 0xffffff801913bc48,

    prev = 0xffffff801913bc48

  }

}

 

我们看到owner对应的结构体: counter = -136948846591,这个就是对应的持锁的owner线程,-136948846591需要用16进制查看:

printf %x -136948846591

ffffffe01d367001

 

 

查看对应task_struct结构体:

struct task_struct ffffffe01d367000

struct task_struct {

  ......

  pid = 789,

  tgid = 789,

  ......

}

这里就是说789这个线程持有了锁,那么我们看一下789这个线程再干嘛:

crash> bt 789

PID: 789    TASK: ffffffe01d367000  CPU: 1   COMMAND: "[email protected]"

 #0 [ffffff800c92bc10] __switch_to at ffffff8757886b64

 #1 [ffffff800c92bcb0] __schedule at ffffff8758a56ba8

 #2 [ffffff800c92bcd0] schedule at ffffff8758a56ff4

 #3 [ffffff800c92bda0] diagchar_write at ffffff8757e3e760

 #4 [ffffff800c92be30] __vfs_write at ffffff8757a72898

 #5 [ffffff800c92be70] vfs_write at ffffff8757a72c54

 #6 [ffffff800c92beb0] sys_write at ffffff8757a72e20

 #7 [ffffff800c92bff0] el0_svc_naked at ffffff875788487c

     PC: 0000007ba1887aa8   LR: 0000007ba0cc6eb0   SP: 0000007fd405cff0

    X29: 0000007fd405d010  X28: 0000000000000000  X27: 0000000000000001

    X26: 0000000000000000  X25: 0000007ba1230440  X24: 0000007ba204b020

    X23: 0000007ba0ab9118  X22: 00000000b619242b  X21: 0000000000000077

    X20: 0000007b9f251008  X19: 0000007b9f251000  X18: 0000007ba2aec000

    X17: 0000007ba1887aa0  X16: 0000007ba0cd9930  X15: 0000b7fc04297d3a

    X14: 001f972848ddbe00  X13: 6d69786f72700912  X12: 0000007b9f23d400

    X11: 0008020a04220010  X10: 0108041a00000202   X9: 0000000000000083

     X8: 0000000000000040   X7: 6d6109120a2506a2   X6: 7974696d69786f72

     X5: 0000007b9f25108b   X4: 0000007ba1227777   X3: 0000007b9f251010

     X2: 000000000000008b   X1: 0000007b9f251000   X0: 000000000000000d

    ORIG_X0: 000000000000000d  SYSCALLNO: 40  PSTATE: 00000000

 

 

继续找到schedule前在干嘛:

crash> sym ffffff8757e3e760

ffffff8757e3e760 (t) diagchar_write+3312 /home/pzc/code/F11Q/kernel/msm-4.14/drivers/char/diag/diagchar_core.c: 3070

 

static int diag_process_apps_data_hdlc(unsigned char *buf, int len,

                   int pkt_type)

{

   ......

   wait_event_interruptible(driver->hdlc_wait_q,

         (data->flushed == 0));

   ......

}

因为这里是需要等待唤醒,唤醒的条件是data->flushed=0,所以我得去调查哪里设置成了1,哪里设置成了0.我加了各种日志发现正常情况下data->flushed是等于0的,但是有一个定时的任务会把它改成1,当diagfwd_mux_write_done之后才会把它改成0.

现在我们又需要去调查什么时候才会diagfwd_mux_write_done,加了了dump_backtrace看了下,发现diagchar_close的时候会diagfwd_mux_write_done。

139.272177] Call trace:

139.272206] dump_backtrace+0x0/0x268

139.272215] show_stack+0x14/0x20

139.272231] dump_stack+0xc4/0x100

139.272244] diagfwd_mux_write_done+0x1f0/0x2f8

139.272254] diag_md_close_peripheral+0x118/0x138

139.272261] diag_mux_close_peripheral+0x8c/0xa8

139.272268] diag_remove_client_entry+0x4d0/0x538

139.272275] diagchar_close+0x50/0x60

139.272287] __fput+0xbc/0x1b8

这个时候我们就得去看看为什么没有及时的调用到diagchar_close。加了很多日志发现一个可疑的卡顿点:

[  180.092586] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

[  180.408500] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

[  180.718953] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

[  181.032199] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

[  181.337216] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -11, retry_count=0

[  181.640242] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -11, retry_count=0

[  181.956562] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

[  182.269410] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

[  182.599262] pzc diag: In diagfwd_write, diagfwd_write usleep_range----- err = -12, retry_count=0

对应的代码:

int diagfwd_write(uint8_t peripheral, uint8_t type, void *buf, int len)

{

    

   ......

   while (retry_count < max_retries) {

      err = 0;

      err = fwd_info->p_ops->write(fwd_info->ctxt, buf, len);

      if (err && err != -ENODEV) {

         usleep_range(100000101000);

         pr_err("pzc diag: In %s, diagfwd_write usleep_range----- err = %d, retry_count=%d\n", __func__, err, retry_count);

         retry_count++;

         continue;

      }

      break;

   }

   ......

}

我们在ramdump中也找到了对应的调用栈:

crash> bt 7756

PID: 7756   TASK: ffffffdf79799000  CPU: 3   COMMAND: "diag_mdlog_syst"

 #0 [ffffff800d26b7f0] __switch_to at ffffff8757886b64

 #1 [ffffff800d26b890] __schedule at ffffff8758a56ba8

 #2 [ffffff800d26b8b0] schedule at ffffff8758a56ff4

 #3 [ffffff800d26b950] schedule_timeout at ffffff8758a5b42c

 #4 [ffffff800d26b9c0] wait_for_common at ffffff8758a57ecc

 #5 [ffffff800d26b9d0] wait_for_completion_timeout at ffffff8758a57f9c

 #6 [ffffff800d26ba90] __glink_spi_send at ffffff87586e8660

 #7 [ffffff800d26baa0] glink_spi_send at ffffff87586e83b4

 #8 [ffffff800d26bab0] rpmsg_send at ffffff87586e0038

 #9 [ffffff800d26baf0] diag_rpmsg_write at ffffff8757e4a7f8

#10 [ffffff800d26bb50] diagfwd_write at ffffff8757e451c8

#11 [ffffff800d26bbf0] diag_send_msg_mask_update at ffffff8757e59c04

#12 [ffffff800d26bc60] diag_cmd_set_all_msg_mask at ffffff8757e5c6e4

#13 [ffffff800d26bc70] diag_process_apps_masks at ffffff8757e5aae4

#14 [ffffff800d26bcd0] diag_process_apps_pkt at ffffff8757e419ec

#15 [ffffff800d26bda0] diagchar_write at ffffff8757e3e468

#16 [ffffff800d26be30] __vfs_write at ffffff8757a72898

#17 [ffffff800d26be70] vfs_write at ffffff8757a72c54

#18 [ffffff800d26beb0] sys_write at ffffff8757a72e20

#19 [ffffff800d26bff0] el0_svc_naked at ffffff875788487c

经过上面的分析我们大概知道了整个流程:

Sensor记录日志导致的亮屏慢问题分析_第1张图片

 

1.首先diagchar驱动初始化的时候有一个work会定时将flushed设置为1.

2.我们的diag_mdlog_system在记录日志时调用diagchar_write卡顿,导致不能及时调用diagchar_close将flushed设置为0,并且调用wake_up_interruptible

3.这时候system_server注册sensor时向[email protected]发起binder call去注册,注册时会调用diagchar_write,因为之前flushed已经被设置为了1所以[email protected]就会停在wait_event_interruptible这里,并进入D状态,导致系统卡住,直到diagchar_close结束

 

你可能感兴趣的:(c/c++,安卓系统,kernel)