Creating Chaos and Hard Faults
The best way to understand why the processor is sending you love letters (exceptions) is to see what they look like when you aren’t also frantically trying to fix your code. This talk goes over the code necessary to cause (and debug) divide by zero, bus errors, stack overflows, and buffer overflows.
For each one, Elecia looks at the information the Cortex-M processor provides and how to use that to determine the cause of the fault. She describes how to use the information in a hard fault handler to create small core dumps to be stored after a system reboot.
What is the main reason a buffer overflow in a callee function can cause execution to later jump to an unrelated location when the caller returns?
Thanks Elecia, that was a great talk. The links are useful. I have the first edition of your book, looking forward to getting the second soon.
Is there a way to dump some or all of the call stack, beyond just the PC & LR to help work out when a particular function/method was called (similar to what the IDE gives)?
Some of my users just take a photo/video of the devices’s screen when they report a crash, so another variant of HFH could be to dump the registers to screen, then wait a few seconds for restart due to the watchdog triggering or wait forever polling for a GPIO key press and force a system restart. Including the SW version would also be useful in any dump, as addresses move between versions.
I sprinkle asserts liberally, including in overloaded new() to catch dynamic memory allocation failures before they cause effects later. I’ve recently switched to using the safe string functions strncpy_s etc that call an abort constraint handler where I have an assert(false) if there’s a buffer overrun etc.
If anyone’s interested, there was a talk by Jean Labrosse in a previous EOC all about the MPU. Barr Group also have a good article on how to prevent and detect stack overflow. Yours ties in well with the talk by Suraj Joseph on software safety.
Resources from this talk:
Introduction to Hard Faults:
First handler shown from FreeRTOS: Debugging and diagnosing hard faults on ARM Cortex-M CPUs
Second handler shown from Memfault: How to debug a HardFault on an ARM Cortex-M MCU | Interrupt (this is the most in-depth resource )
Code used in the demo: github.com/eleciawhite/making-embedded-systems/Ch09_Debugging/
Arm Documentation: [Configurable Fault Status Register - Cortex-M3 (https://developer.arm.com/documentation/dui0552/a/cortex-m3-peripherals/system-control-block/configurable-fault-status-register)
Smashing the Stack for Fun and Profit describes how to manipulate the stack
Buried Treasure and Map Files (and Linker Files) talk and map
Hey Elecia, love your book and podcast! I know this might not be as common, but do you happen to know strategies to cause/debug dynamic memory allocation issues? Thanks so much for your time.
I was trying to fit this in as one last slide but I don't think I'll manage so let me reply directly. (Also, I'm plagiarizing from Ch 9 of my book.)
I have a bag of tricks I use to debug memory issues. The first is to make variables static or global to see if it is a stack issue (or an uninitialized variable issue). Note: if the problem goes away when this happens, the problem is not solved, it is hidden; you still have to fix it.
Second, you can add extra buffers between buffers (red zones). Fill these with known values (such as 0xdeadbeef or 0xA5A5A5A5). After running for a while, look in the red zone buffers to see if the data has been modified, indicating one of the buffers went outside its boundaries. You can do similar things with stacks: fill them with known data at boot and then look for how much of the stack is used after running for a significant amount of time. The highest point the stack reaches (on the edge of where the red zone data has not been modified) is called the high-water mark and represents maximal stack usage.
If you are experiencing odd issues, try making your stack(s) (or heaps) larger to see if it takes longer for the problem to occur. Alternatively, does making the stack smaller cause the issue to happen sooner? Reproducing the bug is an important part of moving from impossible bug to merely very difficult.
Finally, replace heap (malloc and new) buffers with fixed buffers. Not only can this avoid using freed memory, it alleviates issues with trying to allocate too much memory (or trying to allocate a large block when the heap is fragmented).





This was a great talk, as past experience led me to expect (I enjoyed the prior presentation you alluded to and love the Memory Map Land graphic - I use it on my work PC for my lock screen, and a former colleague who just retired loved it as well when he walked in and saw it once).