2019独角兽企业重金招聘Python工程师标准>>>
本系列文章用于记录日常工作中遇到的疑难杂症排查经过,转载请注明出处!!
问题背景:android7.1+linux3.10
问题现象:做休眠唤醒压力测试,一个晚上后发现系统重启,参与测试的所有机器都必现
抓取异常时候的日志,内核log只看到zygote等系统进程在被kill前,几个进程的binder通信出现了异常,bootanim服务被启动,其余无法再看出什么
02-20 02:58:07.828 I/binder ( 0): send failed reply for transaction 2785419 to 1049:1064
02-20 02:58:07.829 I/binder ( 0): send failed reply for transaction 2785424 to 1203:1274
02-20 02:58:07.829 I/binder ( 0): send failed reply for transaction 2785429 to 1720:2201
02-20 02:58:07.829 I/binder ( 0): undelivered transaction 2785440
02-20 02:58:07.830 I/alarm_release( 0): clear alarm, pending 0
02-20 02:58:07.830 I/alarm_release( 0): clear alarm, pending 0
02-20 02:58:07.831 I/binder ( 0): 1049:1049 transaction failed 29189, size 116-0
02-20 02:58:07.832 I/binder ( 0): 1049:1049 transaction failed 29189, size 168-8
02-20 02:58:07.833 I/binder ( 0): 1753:1849 transaction failed 29189, size 68-0
02-20 02:58:07.852 I/init ( 0): Starting service 'bootanim'...
02-20 02:58:07.852 I/binder ( 0): 1807:27628 transaction failed 29189, size 60-0
02-20 02:58:07.853 I/binder ( 0): 1049:1049 transaction failed 29189, size 160-0
02-20 02:58:09.718 I/init ( 0): Service 'zygote' (pid 640) killed by signal 9
02-20 02:58:09.718 I/init ( 0): Service 'zygote' (pid 640) killing any children in process group
02-20 02:58:09.719 I/init ( 0): write_file: Unable to open '/sys/android_power/request_state': No such file or directory
02-20 02:58:09.719 I/init ( 0): write_file: Unable to write to '/sys/power/state': Invalid argument
02-20 02:58:09.720 E/init ( 0): Service 'audioserver' is being killed...
02-20 02:58:09.721 E/init ( 0): Service 'cameraserver' is being killed...
02-20 02:58:09.721 E/init ( 0): Service 'media' is being killed...
02-20 02:58:09.721 E/init ( 0): Service 'netd' is being killed...
02-20 02:58:09.724 I/init ( 0): Starting service 'zygote'...
02-20 02:58:09.748 I/init ( 0): Service 'netd' (pid 650) killed by signal 9
02-20 02:58:09.748 I/init ( 0): Service 'netd' (pid 650) killing any children in process group
02-20 02:58:09.749 I/init ( 0): Service 'audioserver' (pid 641) killed by signal 9
02-20 02:58:09.749 I/init ( 0): Service 'audioserver' (pid 641) killing any children in process group
02-20 02:58:09.750 I/init ( 0): Service 'cameraserver' (pid 643) killed by signal 9
02-20 02:58:09.750 I/init ( 0): Service 'cameraserver' (pid 643) killing any children in process group
02-20 02:58:09.751 I/init ( 0): Service 'media' (pid 644) killed by signal 9
02-20 02:58:09.751 I/init ( 0): Service 'media' (pid 644) killing any children in process group
再看logcat,问题出在唤醒后,PMS等系统服务统统挂逼,结合kernel log中binder通信失败的进程ID,确认是因为进程死了,因此无法通信。从栈回溯看,这些错误不太可能发生,其中一句打印The system died; earlier logs will point to the root cause也佐证了这一观点,这里不是第一现场。
02-20 02:58:05.454 I/power_screen_broadcast_done( 714): [0,17,1]
02-20 02:58:07.660 I/PowerManagerService( 714): Waking up from sleep (uid 1000)...
02-20 02:58:07.665 E/AndroidRuntime( 714): *** FATAL EXCEPTION IN SYSTEM PROCESS: PowerManagerService
02-20 02:58:07.665 E/AndroidRuntime( 714): java.lang.RuntimeException: failed to set system property
02-20 02:58:07.665 E/AndroidRuntime( 714): at android.os.SystemProperties.native_set(Native Method)
02-20 02:58:07.665 E/AndroidRuntime( 714): at android.os.SystemProperties.set(SystemProperties.java:130)
02-20 02:58:07.831 W/Binder ( 1049): Binder call failed.
02-20 02:58:07.831 W/Binder ( 1049): DeadSystemException: The system died; earlier logs will point to the root cause
02-20 02:58:07.849 I/Process ( 1203): Sending signal. PID: 1203 SIG: 9
02-20 02:58:07.851 W/Sensors ( 1049): sensorservice died [0x7955fb2c60]
02-20 02:58:07.851 W/AudioFlinger( 641): power manager service died !!!
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): Unable to contact notification manager
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): android.os.DeadObjectException
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.os.BinderProxy.transactNative(Native Method)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.os.BinderProxy.transact(Binder.java:615)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.app.INotificationManager$Stub$Proxy.setNotificationsShownFromListener(INotificationManager.java:1206)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.service.notification.NotificationListenerService.setNotificationsShown(NotificationListenerService.java:465)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at com.android.systemui.statusbar.BaseStatusBar.setNotificationsShown(BaseStatusBar.java:908)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at com.android.systemui.statusbar.phone.PhoneStatusBar.logNotificationVisibilityChanges(PhoneStatusBar.java:3827)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at com.android.systemui.statusbar.phone.PhoneStatusBar.-wrap10(PhoneStatusBar.java)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at com.android.systemui.statusbar.phone.PhoneStatusBar$7.run(PhoneStatusBar.java:646)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.os.Handler.handleCallback(Handler.java:751)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.os.Handler.dispatchMessage(Handler.java:95)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.os.Looper.loop(Looper.java:154)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at android.app.ActivityThread.main(ActivityThread.java:6119)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at java.lang.reflect.Method.invoke(Native Method)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:969)
02-20 02:58:07.853 V/NotificationListenerService[]( 1049): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:859)
02-20 02:58:07.859 I/ServiceManager( 1752): service 'servicediscovery' died
02-20 02:58:07.859 I/sf_frame_dur( 1753): [,0,0,0,0,0,0,128]
02-20 02:58:07.863 I/ServiceManager( 1752): service 'midi' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'serial' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'hardware_properties' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'backup' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'voiceinteraction' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'pax_mdb' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'telecom' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'clipboard' died
02-20 02:58:07.865 I/ServiceManager( 1752): service 'pax_traffic_impl' died
最早出错的地方位于某次发起休眠时,系统无法找到gralloc库,立马adb进入机器检查,发现此动态库是存在的,那么为什么会找不到呢?况且此休眠唤醒bug是测试一段时间后出现的,为什么之前可以找到,现在找不到了,百思不得其解。把其他出现问题的样机的日志抓取出来,报错的信息是一致的。于是决定跟下源码,看看在哪些情况下会出现加载动态库失败
02-15 00:18:11.067 I/[Gralloc]( 1752): ion_alloc from ion_client:18 via heap type DMA(mask:16) for 3686400 Bytes cached buffer successfully, usage = 0x00000333
02-15 00:18:11.107 I/[Gralloc]( 1752): ion_alloc from ion_client:18 via heap type DMA(mask:16) for 3686400 Bytes uncached buffer successfully, usage = 0x00000f02
02-15 00:18:11.121 E/HAL ( 2113): load: module=/system/lib64/hw/gralloc.sun50iw1p1.so
02-15 00:18:11.121 E/HAL ( 2113): dlopen failed: library "/system/lib64/hw/gralloc.sun50iw1p1.so" not found
02-15 00:18:11.121 E/[Gralloc-ERROR]( 2113): int gralloc_register_buffer(const gralloc_module_t *, buffer_handle_t):189 Could not get gralloc module for handle: 0x0x7b1af1ec80
02-15 00:18:11.121 E/Gralloc1On0Adapter( 2113): gralloc0 register failed: -24
02-15 00:18:11.121 W/GraphicBufferMapper( 2113): registerBuffer(0x7b1af1ec80) failed: 5
02-15 00:18:11.121 E/GraphicBuffer( 2113): unflatten: registerBuffer failed: Unknown error -5 (5)
02-15 00:18:11.121 E/Surface ( 2113): dequeueBuffer: IGraphicBufferProducer::requestBuffer failed: 5
02-15 00:18:11.122 E/[EGL-ERROR]( 2113): void __egl_platform_dequeue_buffer(egl_surface *):1830: failed to dequeue buffer from native window 0x7b1e4e1e10; err = 5, buf = 0x0,max_allowed_dequeued_buffers 3
02-15 00:18:11.122 E/Parcel ( 2113): dup() failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 1, error: Too many open files
实现在frameworks/native/libs/ui/Gralloc1.cpp,在以下三个路径中查找动态库:/system/lib64/hw、/vendor/lib64/hw、/odm/lib64/hw,既然动态库存在,理论上来说应该不是load失败的。
int err = hw_get_module(GRALLOC_HARDWARE_MODULE_ID, &module);
int hw_get_module_by_class(const char *class_id, const char *inst,
const struct hw_module_t **module)
{
int i = 0;
char prop[PATH_MAX] = {0};
char path[PATH_MAX] = {0};
char name[PATH_MAX] = {0};
char prop_name[PATH_MAX] = {0};
if (inst)
snprintf(name, PATH_MAX, "%s.%s", class_id, inst);
else
strlcpy(name, class_id, PATH_MAX);
/*
* Here we rely on the fact that calling dlopen multiple times on
* the same .so will simply increment a refcount (and not load
* a new copy of the library).
* We also assume that dlopen() is thread-safe.
*/
/* First try a property specific to the class and possibly instance */
snprintf(prop_name, sizeof(prop_name), "ro.hardware.%s", name);
if (property_get(prop_name, prop, NULL) > 0) {
if (hw_module_exists(path, sizeof(path), name, prop) == 0) {
goto found;
}
}
/* Loop through the configuration variants looking for a module */
for (i=0 ; iid) != 0) {
ALOGE("load: id=%s != hmi->id=%s", id, hmi->id);
status = -EINVAL;
goto done;
}
hmi->dso = handle;
/* success */
status = 0;
done:
if (status != 0) {
hmi = NULL;
if (handle != NULL) {
dlclose(handle);
handle = NULL;
}
} else {
ALOGV("loaded HAL id=%s path=%s hmi=%p handle=%p",
id, path, *pHmi, handle);
}
*pHmi = hmi;
return status;
}
在仔细搜罗了一遍所有出错打印,没看出来在休眠唤醒过程中有什么异常,直到注意到这些打印“Too many open files”,似乎所有的设备节点、文件句柄都无法操作了,敲命令查看下当前系统的文件句柄数量以及总的句柄限制,果然,句柄爆了,这说明出现了文件句柄泄漏,极有可能是某个模块操作完句柄没有释放,于是休眠的时候查看了下当前句柄数,唤醒后再查了下,确实是增加了。
E/Parcel ( 714): dup() failed in Parcel::read, i is 0, fds[i] is -1, fd_count is 1, error: Too many open files
Line 10178: 02-20 02:57:35.458 E/libsuspend( 714): Error opening /sys/class/ltr553/ps_mode: Too many open files
Line 10179: 02-20 02:57:35.458 E/libsuspend( 714): Error opening /sys/class/pax_tp/wakeup_enable: Too many open files
Line 10180: 02-20 02:57:35.458 E/libsuspend( 714): Error opening /sys/class/tca9539/tca9539_standby_enable: Too many open files
Line 11652: 02-20 02:58:07.668 E/DropBoxManagerService( 714): java.io.FileNotFoundException: /data/system/dropbox/drop26.tmp (Too many open files)
经排查代码,发现是libsuspend静态库中操作句柄后没有释放,造成句柄耗光,当系统需要打开gralloc库申请内存时打开失败,于是系统进程纷纷go died
注:
cat /proc/sys/fs/file-max 查看系统最大文件句柄限制
cat /proc/sys/fs/file-nr 当前系统文件句柄数量
lsof -p &java_pid 查看当前pid的每个文件描述符的具体属性
lsof -p &java_oid 查看当前pid中file descriptor table中FD句柄的总量