案例一 : kernel重启 - mt6580.dtsi
现象 :
平台 : androidN,MTK6580
排查过程: 1. 打串口log,发现如下:
[ 1.607970] <2>.(2)[1:swapper/0]musb-hdrc musb-hdrc.0.auto: Cannot find usb pinctrl iddig_irq_init
[ 1.609094] <2>.(2)[1:swapper/0]Unable to handle kernel paging request at virtual address fffffff9
[ 1.610245] <2>.(2)[1:swapper/0]pgd = c0004000
[ 1.610794] [fffffff9] *pgd=9fffd821, *pte=00000000, *ppte=00000000
[ 1.611581] <2>-(2)[1:swapper/0]Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[ 2.612481] <2>-(2)[1:swapper/0]Non-crashing CPUs did not react to IPI
[ 2.613303] <2>-(2)[1:swapper/0]CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W 3.18.35 #2
[ 2.614409] <2>-(2)[1:swapper/0]task: df060000 ti: df04a000 task.ti: df04a000
[ 2.615304] <2>-(2)[1:swapper/0]PC is at pinctrl_select_state+0x84/0x154 【// 重启log可以看【PC】停在哪儿】
[ 2.616140] <2>-(2)[1:swapper/0]LR is at otg_int_init+0x64/0x154
...
[ 3.000390] <2>-(2)[1:swapper/0][] (pinctrl_select_state) from [] (otg_int_init+0x64/0x154)
【// 重启log可以根据这条log看停在哪个【函数】】
[ 3.001803] r9:60000113 r8:60000113 r7:c1151d70 r6:c1151e30 r5:c109a408 r4:c1151db0
[ 3.003408] <2>-(2)[1:swapper/0][] (otg_int_init) from [] (mt_usb_otg_init+0x120/0x230)
[ 3.004777] r6:c1151e30 r5:c1151db0 r4:df182140
[ 3.005816] <2>-(2)[1:swapper/0][] (mt_usb_otg_init) from [] (mt_usb_init+0x1d8/0x6d0)
[ 3.007176] r6:e1700000 r5:c1151d70 r4:df182140 r3:c12ecaa0
[ 3.008417] <2>-(2)[1:swapper/0][] (mt_usb_init) from [] (musb_probe+0x2d0/0xb24)
[ 3.009720] r10:00000088 r9:e1700000 r8:de918700 r7:de915000 r6:c1151df8 r5:df182140
[ 3.011218] r4:df182000
[ 3.011861] <2>-(2)[1:swapper/0][] (musb_probe) from [] (platform_drv_probe+0x38/0x90)
[ 3.013218] r10:00000000 r9:df212a00 r8:c10501c0 r7:c10501c0 r6:fffffdfb r5:de915010
[ 3.014714] r4:ffffffed
[ 3.015361] <2>-(2)[1:swapper/0][] (platform_drv_probe) from [] (driver_probe_device+0x1d8/0x43c)
[ 3.016841] r7:c113cdbc r6:c1094378 r5:de915010 r4:c113cdb0
[ 3.018096] <2>-(2)[1:swapper/0][] (driver_probe_device) from [] (__driver_attach+0x94/0x98)
[ 3.019521] r10:00000000 r9:df212a00 r8:c0f00600 r7:00000000 r6:de915044 r5:c10501c0
[ 3.021013] r4:de915010
[ 3.021646] <2>-(2)[1:swapper/0][] (__driver_attach) from [] (bus_for_each_dev+0x68/0x9c)
[ 3.023037] r6:c03c2f08 r5:c10501c0 r4:00000000 r3:00000000
[ 3.024278] <2>-(2)[1:swapper/0][] (bus_for_each_dev) from [] (driver_attach+0x24/0x28)
[ 3.025649] r6:c1040f40 r5:df211d00 r4:c10501c0
[ 3.026690] <2>-(2)[1:swapper/0][] (driver_attach) from [] (bus_add_driver+0x15c/0x218)
[ 3.028185] <2>-(2)[1:swapper/0][] (bus_add_driver) from [] (driver_register+0x80/0x100)
[ 3.029567] r7:df04a030 r6:c0f2efb8 r5:c0f617d8 r4:c10501c0
[ 3.030840] <2>-(2)[1:swapper/0][] (driver_register) from [] (__platform_driver_register+0x5c/0x64)
[ 3.032341] r5:c0f617d8 r4:00000000
[ 3.033203] <2>-(2)[1:swapper/0][] (__platform_driver_register) from [] (musb_init+0x34/0x48)
[ 3.034790] <2>-(2)[1:swapper/0][] (musb_init) from [] (do_one_initcall+0x140/0x200)
[ 3.036128] r4:c0f617d8 r3:00000000
[ 3.037016] <2>-(2)[1:swapper/0][] (do_one_initcall) from [] (kernel_init_freeable+0x144/0x1e8)
[ 3.038474] r10:c0f6200c r9:00000141 r8:c0f00600 r7:c10dc7c0 r6:c10dc7c0 r5:c0f62000
[ 3.039968] r4:00000006
[ 3.040635] <2>-(2)[1:swapper/0][] (kernel_init_freeable) from [] (kernel_init+0x10/0x100)
[ 3.042036] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0a80e4c
[ 3.043523] r4:00000000
[ 3.044170] <2>-(2)[1:swapper/0][] (kernel_init) from [] (ret_from_fork+0x14/0x34)
[ 3.045483] r4:00000000 r3:df04a000
...
[ 7.201650] Rebooting in 1 seconds..
2. 根据log分析:由usb20_host.c中的函数otg_int_init()调用pinctrl_select_state()时出错导致重启
往前找到:Cannot find usb pinctrl iddig_irq_init;
到此问题明了:重启是由于mt6580.dtsi里缺少了一个pin属性:“iddig_irq_init”
3. 查看log:在mt6580.dtsi中删掉了这一属性
--> 还原这一属性,ok。
处理方案: 明天特意删掉尝试一下 --> 尝试删掉之后编译报错!【存疑】
总结 :
案例二 : 开机重启 - kernel无法启动 - 怀疑是kernel所在分区有坏块 - 【最后换了块屏就好了】
现象 : 开机还在log界面就重启 ,只有一台机器重启
平台 : androidN,MTK6737
排查过程: 1. 抓取串口log:
[7360] cmdline: console=tty0 console=ttyMT0,921600n1 root=/dev/ram vmalloc=496M androidboot.hardware=mt6735 slub_max_order=0 slub_debug=O androidboot.verifiedbootstate=green bootopt=64S3,32N2,64N2 printk.disable_uart=1 bootprof.pl_t=1809 bootprof.lk_t=3748 boot_reason=0 androidboot.serialno=0123456789ABCDEF androidboot.bootreason=power_key gpt=1
[7360] lk boot time = 3748 ms
[7360] lk boot mode = 0
[7360] lk boot reason = power_key
[7360] lk finished --> jump to linux kernel 64Bit 【// 由lk进入kernel】
[7360]
[LK]jump to K64 0x40080000
[7360] smc jump
[ATF](0)[0.0]save kernel info
[ATF](0)[0.0]Kernel_EL2
[ATF](0)[0.0]Kernel is 64Bit
[ATF](0)[0.0]pc=0x40080000, r0=0x4e000000, r1=0x0
INFO: BL3-1: Preparing for EL3 exit to normal world, Kernel
INFO: BL3-1: Next image address = 0x40080000
INFO: BL3-1: Next image spsr = 0x3c9
[ATF](0)[0.0]el3_exit
【// 以下为重启后】
[ATF](0)[30.162857]aee_wdt_dump: on cpu0
[ATF](0)[30.163290](0) pc: lr: sp: pstate: 600000c5
[ATF](0)[30.164452](0) x29: ffffffc000dffea0 x28: 0000004040000000 x27: ffffffc000080270
[ATF](0)[30.165430](0) x26: ffffffc000e94a00 x25: 0000000000000003 x24: 0000000704c2eee5
[ATF](0)[30.166406](0) x23: 0000000000000000 x22: ffffffc000eced08 x21: ffffffc03f738d48
[ATF](0)[30.167382](0) x20: ffffffc000eced08 x19: 0000000000000000 x18: 0000000000000070
[ATF](0)[30.168359](0) x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ATF](0)[30.169335](0) x14: 0000000000000007 x13: 0000000000000000 x12: 0000000000000078
[ATF](0)[30.170311](0) x11: 000000000000000f x10: 000000000000000c x09: 00000000ffffffff
[ATF](0)[30.171287](0) x08: 0000000000000007 x07: 0000000000035f48 x06: 00000000000006de
[ATF](0)[30.172263](0) x05: 00405f7e0099cf00 x04: 0000000000000019 x03: 0000000099d89d8a
[ATF](0)[30.173239](0) x02: 0000000000085b0c x01: ffffffc000fb4000 x00: 0000000000000000
[pmic_init] Preloader Start,MT6328 CHIP Code = 0x2820
3. 由log分析 : 在lk进入kernel时重启,可能的原因为:flash有坏块,kernel刚好在该坏块
-> 重新格式化下载几次 -> 一样
4. 换块flash -> 一样 -> 再换一块 -> 结果未知(询问项目程工)
5. 最后换了块屏就好了
案例三 : 开机过一会儿重启(约5分钟) - 死锁
现象 :
平台 : androidO,MTK6737
排查过程: 1. 导出mtklog, 查看last_kmsg.log
// 查看PC指针停在哪里
[ 121.420261] -(2)[816:system_server]PC is at spin_bug+0x1d8/0x220
[ 121.420614] -(2)[816:system_server]LR is at spin_bug+0x1cc/0x220
[ 121.420915] -(2)[816:system_server]pc : [] lr : [] pstate: 800001c5
[ 121.421198] -(2)[816:system_server]sp : ffffffc05c0138c0
// 调用踪迹
[ 121.640932] -(2)[816:system_server][] spin_bug+0x1d8/0x220
[ 121.641310] -(2)[816:system_server][] do_raw_spin_lock+0x58/0x338
[ 121.641660] -(2)[816:system_server][] _raw_spin_lock_irqsave+0x5c/0x84
[ 121.642001] -(2)[816:system_server][] down_interruptible+0x18/0x60
[ 121.642368] -(2)[816:system_server][] mc3xxx_mutex_lock+0x18/0x2c // 猜测是锁导致,先把gsensor去掉看是否重启 -> 去掉后正常
[ 121.642744] -(2)[816:system_server][] mc3xxx_suspend+0x8c/0xec
[ 121.643096] -(2)[816:system_server][] i2c_legacy_suspend+0x38/0x48
[ 121.643457] -(2)[816:system_server][] i2c_device_pm_suspend+0x34/0x38
2. 由于是8.0临时版本,未继续深究
案例四 : 开机重启 - 温度检测NTC电阻异常
现象 :
平台 : androidL,MTK6580
排查过程: 1. 抓取串口log:
[ 16.177681] <1>-(1)[345:thermal_manager]PC is at tspa_sysrst_set_cur_state+0x6c/0xd0
[ 16.178644] <1>-(1)[345:thermal_manager]LR is at mtk_cooling_wrapper_set_cur_state+0x190/0x4d0
// 调用踪迹 - 根据“thermal_cdev_update” 推测是温度检测导致
[ 16.316470] <1>-(1)[345:thermal_manager][] (tspa_sysrst_set_cur_state) from [] (mtk_cooling_wrapper_set_cur_state+0x190/0x4d0)
[ 16.318097] r4:dcd28b80 r3:c05a645c
[ 16.318554] <1>-(1)[345:thermal_manager][] (mtk_cooling_wrapper_set_cur_state) from [] (thermal_cdev_update+0xa0/0x18c)
[ 16.320105] r10:dcdc4a18 r9:dcdc4b28 r8:c0dbcff8 r7:dcdc4b40 r6:dcdc4a00 r5:00000001
[ 16.321081] r4:dcdc4ae4
[ 16.321404] <1>-(1)[345:thermal_manager][] (thermal_cdev_update) from [] (backward_compatible_throttle+0x94/0xbc)
[ 16.322892] r10:00000048 r9:dc571a00 r8:00000001 r7:00000000 r6:00000000 r5:dc571b54
[ 16.323868] r4:dc9e6900
[ 16.324191] <1>-(1)[345:thermal_manager][] (backward_compatible_throttle) from [] (handle_thermal_trip+0x68/0x1e4)
[ 16.325691] r9:c0d56530 r8:dc571a18 r7:00000000 r6:00000000 r5:00000000 r4:dc571a00
[ 16.326665] <1>-(1)[345:thermal_manager][] (handle_thermal_trip) from [] (thermal_zone_device_update+0x9c/0x158)
2. kernel-3.18/drivers/misc/mediatek/thermal/common/thermal_zones/mtk_ts_pa.c
static int tspa_sysrst_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
{
cl_dev_sysrst_state = state;
if (cl_dev_sysrst_state == 1) {
...
*(unsigned int *)0x0 = 0xdead; // 怀疑是这里导致重启,注释掉本行 -> 不重启
}
3. 开机后抓取mtklog中的kernel_log, 显示温度异常:125摄氏度(事实上问题没这么高)
Line 18718: <7>[ 306.833625] (0)[57:kworker/0:1][name:mtk_ts_bts&][Power/BTS_Thermal] T_AP=125000
Line 18831: <7>[ 307.833641] (0)[57:kworker/0:1][name:mtk_ts_bts&][Power/BTS_Thermal] T_AP=125000
Line 18969: <7>[ 308.833733] (0)[57:kworker/0:1][name:mtk_ts_bts&][Power/BTS_Thermal] T_AP=125000
Line 19063: <7>[ 309.833632] (0)[57:kworker/0:1][name:mtk_ts_bts&][Power/BTS_Thermal] T_AP=125000
Line 19155: <7>[ 310.834030] (0)[57:kworker/0:1][name:mtk_ts_bts&][Power/BTS_Thermal] T_AP=125000
Line 19218: <7>[ 311.833597] (0)[57:kworker/0:1][name:mtk_ts_bts&][Power/BTS_Thermal] T_AP=125000
4. 怀疑是温度检测NTC电阻异常,经检查,NTC电阻短路(焊锡过多)
处理方案:
总结 : 主板上有多个NTC电阻检测温度(pa、电池等),boot启动会检测温度,过高则reboot
案例五 : 开机重启 - 温度检测
现象 :
平台 : androidN,MTK6737
排查过程: 1. 抓取串口log:
[ 4.864498].(4)[68:bat_thread_kthr][Power/BatMeter] [force_get_tbat] 0,108,0,0,0,60
[ 4.865459].(4)[68:bat_thread_kthr][Power/BatMeter] [oam_run_inf] 4045, 4045, 4010, 2592, 2592, 135, 135, 2, 2, 1782, 60, 16
[ 4.866864].(4)[68:bat_thread_kthr][Power/BatMeter] [oam_result_inf] 16, 16, 16, 16, 16, 0
[ 4.867902].(4)[68:bat_thread_kthr][Power/Battery] AvgVbat=(4010),bat_vol=(4010),AvgI=(0),I=(0),VChr=(359),AvgT=(60),T=(60),pre_SOC=(84),SOC=(84),ZCV=(4044)
[ 4.869657].(4)[68:bat_thread_kthr][Power/Battery] [Battery] Tbat(60)>=60, system need power down.
[ 4.871358].(4)[68:bat_thread_kthr][Power/Battery] charging_set_power_off=0
[ 4.872229].(4)[68:bat_thread_kthr]mt_power_off
2. 根据log,关灯温度过高保护,还是无法开机
kernel-3.18/drivers/power/mediatek/battery_common.c
- if(BMT_status.temperature >= 60)
+ if(0)
3. 再次抓取串口log:
[ 7.837171].(7)[191:thermal_manager]Power/battery_Thermal: reset, reset, reset!!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*****************************************@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[ 7.839444]-(7)[191:thermal_manager]------------[ cut here ]------------
[ 7.840299]-(7)[191:thermal_manager]kernel BUG at kernel/drivers/thermal/mtk_ts_battery.c:394!
[ 7.842208]-(7)[191:thermal_manager]Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
4. 临时关闭检测,可以开机
kernel-3.18/drivers/misc/mediatek/thermal/common/thermal_zones/mtk_ts_battery.c
int mtktsbattery_register_cooler(void)
{
/* cooling devices */
- cl_dev_sysrst = mtk_thermal_cooling_device_register("mtktsbattery-sysrst", NULL,
- &mtktsbattery_cooling_sysrst_ops);
return 0;
}
处理方案:
总结 :
案例六 : 开机休眠后重启 - mxc400x_resume()死锁引起 - 未深究
现象 :
平台 : androidN,MTK6737
排查过程: 1. 查看mtklog中的last_kmsg,可知由mxc400x_resume()函数中的死锁导致
[ 194.487798] -(0)[719:system_server]PC is at __list_add+0x40/0xe0
[ 194.488102] -(0)[719:system_server]LR is at mutex_lock_nested+0x1ac/0x678
[ 194.639589] Backtrace:
[ 194.640103] -(0)[719:system_server][] (__list_add) from [] (mutex_lock_nested+0x1ac/0x678)
[ 194.640325] r7:00000000 r6:c17b7668 r5:60070013 r4:c17b7664
[ 194.641259] -(0)[719:system_server][] (mutex_lock_nested) from [] (mxc400x_resume+0x28/0x64)
[ 194.641488] r10:c1126378 r9:c0dda8d7 r8:00000010 r7:c0773000
[ 194.642407] -(0)[719:system_server][] (mxc400x_resume) from [] (i2c_legacy_resume+0x38/0x44)
[ 194.642638] r5:c11b4164 r4:00000001
[ 194.643273] -(0)[719:system_server][] (i2c_legacy_resume) from [] (i2c_device_pm_resume+0x38/0x3c)
[ 194.643664] -(0)[719:system_server][] (i2c_device_pm_resume) from [] (dpm_run_callback+0x120/0x234)
[ 194.644051] -(0)[719:system_server][] (dpm_run_callback) from [] (device_resume+0xb4/0x198)
2. 注释掉mxc400x_resume()中的锁-->不重启
static int mxc400x_resume(struct i2c_client *client)
{
struct mxc400x_i2c_data *obj = i2c_get_clientdata(client);
int err = 0;
if(obj == NULL)
{
GSE_ERR("null mxc400x!!\n");
return -EINVAL;
}
- mutex_lock(&mxc400x_mutex);
err = mxc400x_init_client(client, 0);
if(err)
{
GSE_ERR("initialize client fail!!\n");
- mutex_unlock(&mxc400x_mutex);
return -EINVAL;
}
atomic_set(&obj->suspend, 0);
- mutex_unlock(&mxc400x_mutex);
return err;
}
处理方案:
总结 :
案例七 : dtsi里缺少node,在preloader阶段重启 - 添加node
现象 : 屏不亮
平台 : androidO,MTK6737
排查过程: 1. 打印串口log:
[175] Copy DTB from 0x41f07086 to 0x4e000000(size: 0xb442)
[176] [LK] fdt setup addr:0x4e000000 status:1!!!
[176] [partition_get_index]find odmdtbo index 20
[178] Multiple ODM DTBO.
[178] ODM mdtbo_index: 0, dtbo_offset: 1024, dtbo_size: 48768
[179] [partition_get_index]find odmdtbo index 20
ata start bit at rising edge
[28] [SD0]e80
ERROR: ufdt_overlay_do_fixups():Couldn't find 'strobe' symbol in main dtb // 显示缺少node"strobe"
ERROR: ufdt_overlay_apply():failed to perform fixups in overlay
[189] ufdt_apply_overlay() failed!
[189] app/mt_boot/mt_boot.c:line 407 0
==> 在mt6735m.dts中添加node
2.还是重启,再打log:
[176] [LK] fdt setup addr:0x4e000000 status:1!!!
[176] [partition_get_index]find odmdtbo index 20
[178] Multiple ODM DTBO.
[179] ODM mdtbo_index: 0, dtbo_offset: 1024, dtbo_size: 48768
[179] [partition_get_index]find odmdtbo index 20
[182] blob_len: 0x80000, overlay_len: 0xbe80
ERROR: ufdt_overlay_do_fixups():Couldn't find 'leds' symbol in main dtb // 显示缺少node"leds"
ERROR: ufdt_overlay_apply():failed to perform fixups in overlay
[189] ufdt_apply_overlay() failed!
[190] app/mt_boot/mt_boot.c:line 407 0
==> 在mt6735m.dts中添加node
3. 不再重启
4. 分析:由于在dts中引用了节点,追加属性,以达到dts控制gpio口,preloader阶段解析dtb的时候,就会出错:
&strobe {
pinctrl-names = "default", "main_strobe_oh", "main_strobe_ol", "main_strobe_flash_oh", "main_strobe_flash_ol", "sub_strobe_oh", "sub_strobe_ol", "psel_pinctrl_oh", "psel_pinctrl_ol", "charger_enable_pinctrl_oh", "charger_enable_pinctrl_ol";
pinctrl-0 = <&strobe_intpin_default>;
pinctrl-1 = <&main_strobe_oh>;
pinctrl-2 = <&main_strobe_ol>;
pinctrl-3 = <&main_strobe_flash_oh>;
5. 注:电池电量不足(电压过低),也会导致无法开机:
[911] mtk detect 3130
[985] [AUXADC] ch=0 raw=18999 data=3130
[985] [mt65xx_bat_init] check VBAT=3130 mV with 3450 mV
[986] [BATTERY] battery voltage(3130mV) <= CLV ! Can not Boot Linux Kernel !!
案例八 :
现象 :
平台 : androidN,MTK6737
排查过程: 1.
2.
3.
4.
处理方案:
总结 :
案例九 :
现象 :
平台 : androidN,MTK6737
排查过程: 1.
2.
3.
4.
处理方案:
总结 :
案例七 :
现象 :
平台 : androidN,MTK6737
排查过程: 1.
2.
3.
4.
处理方案:
总结 :
案例七 :
现象 :
平台 : androidN,MTK6737
排查过程: 1.
2.
3.
4.
处理方案:
总结 :