1. APSS stability
2. MPSS stability
3. TZ stability
4. RPM stability
5. ADSP stability
6. Unknown reset
1.APSS Stability
Type |
Type in detail |
Log indication |
How to debug? |
Possible reasons |
Kernel Panic |
Data abort |
Unable to handle kernel paging request at virtual address 00001098 gpd = ffffffc00007d000 … Internal error: Oops: 96000045 [#1] PREEMPT SMP … |
1. load to T32 2. v.f, check call stack 3. d.l PC, check current PC, and why it loads to wrong address 4. Go through assembly code to see why the address value is wrong 5. Read code to check critical variables to find the final reason
|
1. Check it’s a SW issue or HW issue? If random crash occours due to various reasons, generally it’s a HW issue:cache error, register corruption, DDR memory corruption/bitflip etc. Actions: 1) If see DDR memory corrupted, Do DDR stress test by QMesa; 2) Check HW PDN; 3) Improve core voltage, disable CPR, disable one CPU to see whether issue disappeared; 4) Single device issue, do RMA;
2. If it can reproduce regularly with the same symptom, it must be a SW BUG. Need to follow left steps and read code to fix the bug.
3. If it’s a once issue, also need to follow the steps to check why panic happened. If we identified memory/cache/register corruption happened. Then we have to monitor the device. |
Prefetch abort |
Bad mode in Synchronous Abort handler detected, code 0x86000004 … PC is at 0x30303063706920 LR is at 0x3030303063706920 … |
1. Load to T32 2. D.l PC, check current PC, 3. D.dump r(sp) or D.v %y.ll r(sp), check stack 4. Read code to check critical variables to find the final reason, it is often caused by stack corruption |
||
Stack_protector |
Kernel panic – not syncing: stack-protector: Kernel stack is corrupted in: ffffffc000263f20 |
arch/Kconfig: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_STACKPROTECTOR=y
1. Load to T32 2. v.f check call stack 3. d.dump r(sp) 4. d.l current function
|
||
Undefined instruction |
Internal error: Oops – undefined instruction: 0 [#1] PREEMPT SMP ARM Modules linked in : …
|
1. load to T32 2. v.f 3. d.l PC 4. mmu.pt.list r(pc) 5. check why the instruction is undefined |
||
Cache Error |
SBE: [ 26.079779] EDAC DEVICE0: CE: cache instance: cpu1 block: L1 ‘A53 L1 Correctable Error’ [ 26.087819] EDAC arm64: ARM64 CPU ERP: Single-bit error interrupt received on CPU 1! [ 26.095487] EDAC arm64: Single-bit error information from CPU 1, MIDR=0x410fd031: [ 26.102775] EDAC arm64: Cortex A53 CPU1 L1 Single-bit Error detected [ 26.109107] EDAC arm64: CPUMERRSR value = 0x9281040183 [ 26.114056] EDAC arm64: L1 Instruction data RAM bank is 1 [ 26.119438] EDAC arm64: Repeated error count: 146 [ 26.124124] EDAC arm64: Other error count: 10
CCI Error: [45.812121] EDAC arm64: CCI error interrupt received! [ 45.817228] EDAC arm64: CCI imprecise error register: 00000000. [ 45.823044] EDAC DEVICE0: UE: cache instance: cpu0 block: L3 ‘CCI Error’ [ 45.829707] Kernel panic -not syncing: EDAC cache: UE instance: cpu0 block L3 ‘CCI Error‘
|
1. Check back trace 2. Check the access prior to these messages in dmesg and RTB logs, although they will not always be an exact indicator of what is causing a particular error.
Defconfig CONFIG_EDAC_CORTEX_ARM64_PANIC_ON_CE ü Configure to enable panic on correctable single-bit error ü If config is disabled the command line argument has no effect CONFIG_EDAC_CORTEX_ARM64_PANIC_ON_UE ü Configure to enable panic on uncorrectable double-bit error CORTEX_ARM64_EDAC.PANIC_ON_CE=0 ü Add this on the command line to disable panic of correctable single-bit error |
||
Non secure WD |
APSS Non Secure WD bark/bite |
In call stack … Wdog_bark_handler() Handle_irq_eventpercpu() … Gic_handle_irq() El1_irq() àexception Some functions() … Kthread() |
Check points: 1. Timer list 2. Work list 3. TRECE events log 4. Register information
Steps: 1. v.f 2. v.v wdog_data Check wdog_data->alive_mask->bits. 3. Restore the kernel stack manually 4. Check trace event log
Other tests: 1. Disable CPU3, whether reproduced; 2. Bump APC0 volage 25mV, whether reproduced; 3. HW issues: RMA. |
1. Excessive serial logging. a. Serial logging is costly b. Spends too much time with IRQ disabled c. Pr_err() in kmalloc() 2. RT thread a. RT thread runs too long not yielding a processor b. Fastmixer 3. Low memory a. No idle worker b. Kthreadd can’t create worker_thread 4. DDR corruption a. Spinlock corruption 5. Online CPU unable to send ACK back to IPI.
|
Secure WD |
Secure WD |
In RST_STAT.BIN, it stores GCC_RESET_STATUS.
< Reset registers(Saved by SDI) >: GCC_RESET_STATUS : 0x03 or 0x23 PMIC Registers (SDI) : PON_REASON1 : 0x00 PON_WARM_RESET_REASON1 : 0x21 PON_WARM_RESET_REASON2 : 0x00 POFF_REASON1 : 0x02 POFF_REASON2 : 0x00 PON_SOFT_RESET_REASON1 : 0x00 PON_SOFT_RESET_REASON2 : 0x00
< TZ log >: CPU |Reset Reason |Reset Count 0 |0x00000000 (TZBSP_ERR_FATAL_NONE )|0x00000000 1 |0x00000000 (TZBSP_ERR_FATAL_NONE )|0x00000000
|
How to debug: 1. SDI will save cache and CPU context when secure wd bite happened. 2. If all CPUs are offline, check from RPM side. 3. If at least one CPU is online, check from AP side, could check RTB logs, TZ diag logs etc. 4. In RTB log, check whether there is some action to disable clock. Then it is highly suspected. 5. Normally it involves effort from multiple technical teams. A series of tests is required. |
Crash reason: TZ failed to respond to FIQ
Possible reasons: 1. DDR lock up 2. APPS fails to wake up 3. Un-clocked register access |
Bus hang |
AHB timeout |
In TZ log or F3 messages: Fatal Error: AHB_TIMEOUT
Or Fatal Error: NOC_ERROR |
How to debug: 1. Check TZ diag log 2. Check parse NOC error message 3. Check what master was doing 4. Check if required clock for slave was set 5. Check RTB log.
TZ Diag log: ERRLOG0: cause of Noc error type ERRLOG1: InitFlow, TargFlow, TargSubRange, Source(PID, BID, MID), SeqID parameters. ERRLOG2: RouterID for the error when router ID is greater than 32bit ERRLOG3: complete address/offset of the address for which the error was detected.
|
Reason: AHB timeout salve, which monitors the bus for hang. When a hang detected, it generates an interrupt to any processor.
NOC error handler software is integrated into the AHB timeout driver to provide complete coverage for bus hang detection.
Possible reasons: Un-clocked register access |
2.MPSS Stability
Type |
Type in detail |
Log indication |
How to debug? |
Possible reasons |
Q6/QuRT exception |
Processor exception |
ExIPC: Exception recieved tid=161 inst=c0075d2c
|
1. Recover call stack 2. Check SSR to get detail exception reason; The last two bytes indicates exception rason: 0x70, TLB miss-RW 0x71,TLB miss-Write 0x15, invalid packet 0x3A, BADTRAP … 3. d.l @ELR 4. d.dump SP 5. Read code to find why exception happened, why register value is not correct.
|
SW reasons: 1. NULL pointer; 2. Address is not correct; 3. Stack corrupted by other tasks;
HW reasons: 1. Bit flip happened. 2. Register corruption. |
QuRT error |
In kernel log: SFR Init: wdog or kernel error suspected..
In modem log: MODEM - ERR_PRECISE, a precise exception occurred at instruction 0x86842558 with BADVA 0x8B04BBC0
Modem - ERR_NMI, Error cause: ERR_NMI, an NMI occurred.
MODEM - ERR_TLBMISS, TLBMISS RW occurred at instruction 0xC003AE24 with BADVA 0x00000008
MODEM - ERR_ASSERT, Kernel Assert: at instruction 0xC13BF268 with BADVA 0xD1396C48 |
1. v.v QURT_error_info, check cause and cause2(If cause = 5). 2. V.f 3. D.l PC, check assembly. 4. According to the QuRT error reasons (PRECISE, NMI, TLBMISS, ASSERT etc), then decide to check different points. 5. Chec MB. 6. Find the final cause by efforts. |
SW reasons: 1. NULL pointer; 2. Address is not correct; 3. Stack corrupted by other tasks;
HW reasons: 1. Bit flip happened. 2. Register corruption. |
|
Timer error |
MODEM - timer_slaves.c:1011 No items on time slave task free cmd q on 0,Q_ele=24,Free_Q=0
|
1. v.v timer_slaves_cmd_q to see which timer is full/empty; 2. Check call stack of 3 timer slave tasks. Should be one task pending somewhere, can’t handle other command; 3. Check timers_expired_slave1/2/3;
|
SW reasons: 1. Some timer CB functions didn’t release in time.
HW reasons: 1. Bit flip happened. 2. Register corruption. |
|
Heap error |
MODEM - memheap.c:2127 In task 0xfa, Assertion ! INTEGRITY_CHECK_ON_USED_HEADER (heap_ptr->magic_num_use
|
1. Running Heapwalker 2. Check modem_mem_heap -> incomingBlock; 3. D.dump SP 4. Check why the heap value is not correct according to stack and global variables; |
SW reasons: 1. The client who wrongly use heap, like free twice, 2. The input pointer is wrong, not a correct heap.
HW reasons: 1. Bit flip happened. 2. Register corruption. |
|
Stack error |
stack_protect.c:65 Stack Check Failed
It’s a stack overflow issue. Usually local array overflow will cause this crash. If you have added -fstack-protector during compilation, the stack checker will report this crash which shows that some of the functions may have stack overflow or stack corruption.
When a stack is corrupted/overwritten by other tasks, it can also cause other exceptions if -fstack-protector is not added. |
1. V.f, check the call stack 2. D.dump SP, check the stack 3. Check assembly in detail to see which value is wrong, and whether the content on stack is corrupted. 4. Check whether 128 padding 0xF8 is corrupted in stack. 5. Check in functions, whether there is huge stack allocation |
SW reasons: 1. huge stack allocation, e.g. Big local variables, too many function calls.
Method to increase stack size: in modem_proc\build\bsp\modem_proc_img\build\ modem_task_stksz.csv. increase the size of the task.
|
|
MPSS Watch Dog |
Software watchdog timeout |
dog.c:1498 Watchdog detects stalled initialization |
1. v.v dog_state_table to see which task didn’t pet DOG in time. 2. Check the call stack of this task. Whether there is possible functions in infinite loop; 3. If the task is waiting for some futex, then check whether this futex is holding by other tasks, whether a dead lock happened. 4. Check MB, whether some task lasts too long, but didn’t pet DOG. |
1. Infinite loop in some task; 2. Dead lock; 3. Some task lasts too long; |
Hardware watchdog bark |
Seldom happen. Normally a SW dog or a HW WD bite happened. |
|
|
|
Hardware watchdog bite |
In Dmesg: Watchdog bite received from modem software!
|
It is always difficult to debug because of insufficient logs. Cache NOT flushed.
Normally need enable ETB log, and reproduce. Need involve Qualcomm to solve.
|
SW reasons: 1. Some bugs in QuRT error handler; 2. Un-clocked register access
HW reasons: 1. RMA devices; 2. PDN issue; |
|
Error fatal |
Error fatal in core modules |
Some error fatals are handled by BSP team, E.g.: Diag: diagcomm_sio.c:1269 Assertion 0 failed
MProc: glink_channel_migration.c:426?? Assertion status == GLINK_STATUS_SUCCESS failed
EFS: fs_rmts_pm.c:962 2,0,0,Partition data not meant for this partition
|
Read code to find the error fatal reason, check call stack and input variables to see why error fatal happened. |
Normally a SW bug.
|
Error fatal in protocol stack modules |
The error fatal in protocol are handled by modem team. E.g: LTE crash: lte_ml1_mgr_modules.c:621:Assert stm_error_flag == STM_SUCCESS failed: LTE_ML1
GSM crash: gl1_hw_sleep_ctl.c:5028?? SLEEP:Error recovery attempt FAILED. g_slept 595990 duration 440613 missed_frames 4
RF crash: rf_dispatch_snum.c:497:rf_dispatch_snum_pop_item: SNUM Node Item Exhausted (Ma |
Handled by protocol team. |
Handled by protocol team. |
3.TZ Stability
Type |
Type in detail |
Log indication |
How to debug? |
Possible reasons |
||||
Non secure WDOG |
|
In call stack … Wdog_bark_handler() Handle_irq_eventpercpu() Scm_call() … Gic_handle_irq() El1_irq() àexception Some functions() … Kthread()
|
1. Non secure WD is checked from APPS perspective first. 2. TZ is one of the root cause for NS WDOG. 3. Scm call in call stack of HLOS, and scm_lock() didn’t return. 4. V.v scm_lock -> comm 5. Task.dtask, check the task state. |
SW reasons
HW reasons:
|
||||
XPU error |
MPU error |
xpu: ISR begin XPU ERROR: Non Sec!! xpu:>>> [1] XPU error dump, XPU id 3 (BIMC_MPU0)<<< xpu: uErrorFlags: 00000016 xpu: HAL_XPU2_ERROR_F_CLIENT_PORT xpu: HAL_XPU2_ERROR_F_MULTIPLE uBusFlags: 00000521 xpu: HAL_XPU2_BUS_F_ERROR_AC xpu: HAL_XPU2_BUS_F_APROTNS xpu: HAL_XPU2_BUS_F_AOOO xpu: HAL_XPU2_BUS_F_ABURST xpu: uPhysicalAddress: 80c000f4 xpu: uMasterId: 00000000, uAVMID : 00000003 xpu: uATID : 00000000, uABID : 00000002 xpu: uAPID : 00000000, uALen : 00000000 xpu: uASize : 00000002, uAPReqPriority : 00000000 xpu: uAMemType: 00000000
|
1. Check B/P/M to find which client is; a) XPU ID – [3] is (BIMC_MPU0) b) Virtual Master ID – uAVMID [3] is (TZBSP_VMID_AP) c) Bus ID – uABID [2] d) Port ID – uAPID[0] is (Kryo) e) Master ID – uMasterId[0] is CL1 CPU 0
2. uPhysicalAddress: 80c000f4, is the address that the client is trying to access; 3. Check why this client access this address. |
Read 80-NV396-70_XPU ERROR ANALYSIS FOR MSM8996 for details.
Possible reasons: 1. RMA devices 2. Un-clocked register access 3. APPS/MPSS wants to access some peripherals(e.g. SPI), but RPM didn’t give it access rights. |
||||
APU and RPU error |
xpu: ISR begin XPU ERROR: Non Sec!! xpu:>>> [5] XPU error dump, XPU id 45 (BAM_BLSP1_DMA)<<< xpu: uErrorFlags: 00000002 xpu: HAL_XPU2_ERROR_F_CLIENT_PORT uBusFlags: 00080021 xpu: HAL_XPU2_BUS_F_ERROR_AC xpu: HAL_XPU2_BUS_F_APROTNS xpu: HAL_XPU2_BUS_F_NONSECURE_RG_MATCH xpu: uPhysicalAddress: 00019000 xpu: uMasterId: 00000000, uAVMID : 00000003 xpu: uATID : 00000000, uABID : 00000002 xpu: uAPID : 00000000, uALen : 00000000 xpu: uASize : 00000000, uAPReqPriority : 00000000 xpu: uAMemType: 00000000 Fatal Error: XPU_VIOLATION |
1. Check B/P/M to find which client is; a) XPU ID – [3] is (BAM_BLSP1_DMA) b) Virtual Master ID – uAVMID [3] is (TZBSP_VMID_AP) c) Bus ID – uABID [2] d) Port ID – uAPID[0] is (Kryo 0) e) Master ID – uMasterId[0] is CL1 CPU 0
2. Because it is APU, this address maps to the offset from [BAM_BLSP1_DMA address base MSB] + 00019000 and the result is as follows: [0x07544000+ 00019000] = 0x755D000 To find the details of the register at 0x755D000 is as follows: BLSP1_BLSP_BAM_P_CTRL_20 3. Check why this client access this address. |
Read 80-NV396-70_XPU ERROR ANALYSIS FOR MSM8996 for details.
Possible reasons: 1. RMA devices 2. Un-clocked register access 3. APPS/MPSS wants to access some peripherals(e.g. SPI), but RPM didn’t give it access rights. |
|||||
NOC error |
|
n SNOC Error – ERRLOG0 = 0x80030000 n SNOC Error – ERRLOG1 = 0x6a52810e n SNOC Error – ERRLOG2 = 0x00000000 n SNOC Error – ERRLOG3 = 0x014c5000 n SNOC Error – ERRLOG4 = 0x00000000 n Fatal Error – NOC_ERROR n PCNOC Error – ERRLOG0 = 0x80030000 n PCNOC Error – ERRLOG1 = 0x14225024 n PCNOC Error – ERRLOG3 = 0x00005000 n PCNOC Error – ERRLOG4 = 0x00000000 n Fatal Error – NOC_ERROR
|
General flow: 1. Decode the route ID per the decomposition table of a specific NoC to get values for the InitFlow, TargFlow, TargSubRange, SrcId.PID, SrcId.BID, SrcId.MID, and SeqId parameters. 2. Look up master information (route ID composition of a specific NoC) from the InitFlow value. 3. Look up slave information (route ID composition of a specific NoC) from the TargFlow value. 4. Look up the source bus, port, and master by using SrcId.BID, SrcId.PID, and SrcId.MID values, respectively.
Example: CNOC Error: 0x1690D079 Source: 0x05 Destination: 0x29 MID: 0x0F BID: 0x02 PID: 0x03
MID: 0x0F For the mss_cfg, check from CNOC HDD, its base address is 0xfc800000 qhm0_rpm_M2 [0xfd000000:0xfc800000] qhs7_mss_cfg WR 1 bytes Req 81.3 ns
And the offset is 0x80040 from ERRLOG3. CNOC ERROR: ERRLOG3 = 0x00080040
Check from ipcat, this address is MSS_QDSP6SS_NMI.
Need check why APPS access MSS_QDSP6SS_NMI and cause the NOC error. |
Read 80-NV396-71_NoC Error Debug for MSM8996 User Guide for details.
Possible reasons: 1. RMA devices 2. Un-clocked register access
|
4.RPM Stability
Type |
Type in detail |
Log indication |
How to debug? |
Possible reasons |
|
Bus Fault exception |
< RPM log > 145.418005: Clock: gcc_dehr_clk Requested State = Enable. Reference Count = 1 191.852022: rpm_err_fatal (lr: 0x0010c149) (ipsr: 0x00000005)
|
1. Load RPM dump to T32 2. Run rpm_restore_from_core.cmm 3. Run rpm_m3_unstack.cmm 4. Run rpm_parse_faults.cmm 5. v.f, check call stack 6. d.l PC 7. Check why it tries to access this AHB bus |
Bus error occurs when AHB interface receives an error response from a bus slave – RPM does not have access permission. |
APPS Non secure WD |
< RPM log > 9.602248: rpm_halt_exit 9.602253:rpm_abort_interrupt_received (APPS_NON_SECURE_WD_BITE) … aborting 9.602256: rpm_error_fatal (lr: 0x0010c7b) (ipsr: 0x00000049) – “unknown interrupt 73”
|
RPM receives the notification interrupt from HW when a NON secure WD bite occurs within APPS to preserve RPM states. This is not a RPM error. Need to check the non-secure WD from AP. |
Not an error in RPM. |
|
LDO setting timeout |
0x000000002085DA3D: rpm_apply_request (resource type: ldoa) (resource id: 16) 0x000000002085DB3D: rpm_apply_request (resource type: ldoa) (resource id: 16) 0x000000002085DC3D: rpm_apply_request (resource type: ldoa) (resource id: 16) 0x000000002085DD3D: rpm_apply_request (resource type: ldoa) (resource id: 16) 0x000000002085DE3D: rpm_apply_request (resource type: ldoa) (resource id: 16) 0x0000000020867021: START Apply() VS0B 0x000000002086702D: START Post-Dep() VS1039B 0x000000002086703E: rpm_err_fatal (lr: 0x00013007) (ipsr: 0x00000000)
|
1. Check RPM log, to see who requires to set LDO, and which LDO is to set. 2. V.f in RPM T32. Check whether there is error happened in RPM call stack. 3. Check with HW for such issue.
-000|abort() -001|pm_rpm_check_vreg_settle_status( | ?, | estimated_settling_time_us = 200 = 0xC8, | pwr_res = 0x00099FFC = DALPROP_StructPtrs_8996_xml[79], | comm_ptr = 0x00099E64 = , | settling_err_en = 1 = 0x1) | vreg_status = 0 = 0x0 | current_time = 545679961 = 0x20866A59 | settle_end_time = 8718783610880 = 0x000007EE00000000 | return of pm_pwr_is_vreg_ready_alg = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1 | return of pm_rpm_check_battery_status = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1 | return of pm_pwr_status_reg_dump_alg = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1 | return of pm_pwr_is_vreg_ready_alg = PM_ERR_FLAG__SBI_OPT_ERR = 1 = 0x1 —|end of frame
|
Mostly it’s a HW issue, like 1. Pin shortage 2. Insufficient headroom, etc. |
|
RPM WD bark |
< RPM log > 43862.139223: rpm_process_request (master: “APPS”) (resource type: clk2) (id: 0) (full name: bimc) … 43862.139406: Clock: gcc_ddr_dim_cfg_clk Requested State = Enable. Reference Count = 1 43862.170521: Rpm_err_fatal (lr: 0xfffffff9) (ipsr: 0x00000041)
|
1. Load RPM dump to T32 2. Run rpm_restore_from_core.cmm 3. v.f, check call stack 4. Normally RPM WD bark happened while waiting for some bit to be cleared by HW, but it didn’t. E.g: Waiting for GCC_BIMC_DDR_CPLL_CMD_RCGR bit0 to clear when APPS requested new BIMC freq from 547.2MHz to 777.6MHz. |
Mostly it’s a HW issue. Check with HW together. |
|
VDD MIN |
System can’t enter VDD_MIN |
1. Need to triage what is preventing VDD_MIN Conditions that must be met to enter VDD_MIN CXO = off VDD_DIG = Retention level VDD_MEM = Retention level 2. Check npa-dump.txt log Nap_client (name: APSS) (handle: 0x198728) (resource: 0x198548) (type: NPA_CLIENT_REQUIRED) (request: 1) 3. Check railway.txt Follow Railway.rail_state[n].voter_list_head to walk through the voters.
|
Mostly it’s SW reasons make RPM can’t enter vdd_min. |
5.ADSP Stability
Handled by Audio team, submit cases toQualcomm Audio ADSP team.
6.Abnormal Reset
Type |
Type in detail |
Log indication |
How to debug? |
Possible reasons |
Unknown reset |
|
In QCAP log, there is no error in APPS/Modem/TZ/RPM.
Read GCC_RESET_STATUS register from RST_STAT.BIN(0x8600760):
Bits Filed name 5 MSM_TSENSE_RESET_STATUS 4 PROC_HALT_CTI_STATUS 3 SRST_RESET_STATUS 2 MSM_TSENSE0_RESET_STATUS 1 PMIC_RESIN_RESET_STATUS 0 SECURE_WDOG_EXPIRE_RESET_STATUS |
1. Read GCC_RESET_STATUS, if it’s 0. Then 2. Read PMIC warm reset 1 and 2 -> If not 0, then there was a PMIC warm reset. 3. Read PMIC warm reset 1 and 2 -> If it’s 0, then there was no PMIC warm reset, check PON_REASON and POFF_REASON1 / POFF_REASON2 registers. |
It happened mostly in following tests: Throwing, drop, rolling ESD
Possible reasons: HW PDN SW bug Bad sample Thermal
|