Home > On-Demand Archives > Theatre Talks >
Debugging Embedded Devices at Scale: Effective Techniques for Diagnosis and Resolution
Tyler Hoffman - Memfault - Watch Now - EOC 2023 - Duration: 39:22
In this presentation, I’ll walk through effective approaches for detecting, diagnosing, debugging, and resolving issues in embedded firmware and devices deployed at a large scale, such as in populations of hundreds of thousands or millions. While much has been written on monitoring smaller fleets, typical strategies like onsite debugging, debugger-based diagnosis, and manual log analysis can fail when dealing with massive populations of devices.
The presentation focuses on low-level debugging techniques like fault handling and exception parsing on the device, as well as a novel approach to capturing core dumps on Cortex-M MCUs. The presentation provides guidance on collecting core dumps and diagnosing crashes and faults without a debugger. I’ll also explore the changes in developer behavior that can be implemented when core dump functionality is integrated into firmware, and dumps are received centrally.
This presentation includes content that is often overlooked by online resources, which assume that firmware works flawlessly and bugs are not introduced by developers. We know this is not the case and have developed these strategies to offer real-world solutions for new and experienced firmware engineers who are eager to tackle the challenges of debugging at scale.
Thanks for the talk .
We are implementing remote sensors via a FreeRTOS aware endpoint board using LoRaWan for the backhaul to a cloud UI.
We have been experimenting with Memfault, thus far uploading data via debug console dump and posting to the UI site, so far it's been a very positive experience.
How much data must we get to the Memfault platform to really make it useful for crash analysis? I would assume at least partial stack data for the faulting task, perhaps restart count, etc..
Any hints for trying to implement at least some Memfault functionality given the limited data transferred allowed within the LPWAN environment such at LoRaWAN?
Thanks
We've written about how to minimize bandwidth usage with Memfault in this document - it should answer many of your questions! https://docs.memfault.com/docs/best_practices/low-bandwidth-devices
Thanks Tyler
What are some techniques that you use to extract the core dump from your system? It seems to me that the system may have a difficult time recording this information at the very time that you need it most (e.g. inside the assert call). Do you still trust the driver that controls the non-volatile memory at this time? Do you try saving to the ROM somewhere?
The coredump is typically saved in noinit RAM or directly to external flash with a very minimal driver. I've personally used these methods to collect coredumps from hard faults, memory faults, memory corruption, asserts, and anything else, and I would say it works 99% of the time.
You can find a bit more documentation about how Memfault recommends saving coredumps to flash at https://docs.memfault.com/docs/mcu/coredumps/. I know Zephyr has a coredump storage that it uses as well, but I can't speak to it's reliability.
In terms of extracting the coredump from the system - it typically is chunked up into smaller packets and sent over BLE, Wi-Fi, LTE, or other protocol to a gateway and then sent somewhere a whole entity, or the chunks are forwarded directly to a server to be reassembled. Memfault's strategy of chunking data is detailed at https://docs.memfault.com/docs/mcu/data-from-firmware-to-the-cloud.
Further information about the commercial solution (so I understood from Q&A) can be found at https://docs.memfault.com/
What I'm still curious about: Is there an official site for the coredump component, beside probably the code in ESP-IDF, Zephyr, and so on ?
Absolutely. The Memfault documentation for coredumps can be found here: https://docs.memfault.com/docs/mcu/coredumps
Also, Memfault's firmware SDK is open-source, and can be found at: https://github.com/memfault/memfault-firmware-sdk
How to debug an IoT based solution where the modem stack has been developed by a vendor while the application code is under company’s supervision. I mean how to zero down on the RCA when the device is not up due to any uncertain reason(maybe hardware or software)
That's a tough one! You definitely need to have access to the source code and be able to modify it and get back debugging information.
Further information about the commercial solution (so I understood from Q&A) can be found at https://docs.memfault.com/
What I'm still curious about: Is there an official site for the coredump component, beside probably the code in ESP-IDF, Zephyr, and so on ?
For sure. You can find the coredump code for our SDK online in GitHub at https://github.com/memfault/memfault-firmware-sdk. In our documentation, you can find it under MCU -> Subsystem Guides -> Coredumps. https://docs.memfault.com/docs/mcu/coredumps
When looking at the memfault website: It advertises "free try" all over the place, but states exactly nothing about subsequent pricing or models. Not even a hint of a pointer, or did I miss it ?
Could you provide some insight, examples, anything ?
Thanks for sharing all of this Tyler! I really liked the idea of bundling asserts with other functions using macros. Going to see about taking advantage of that on some of the projects I'm working now...
Awesome! It's definitely a good way to get everyone on the team to use "best-practices" without trying to encourage that during code reviews.
The other side benefit of using the same wrappers is that you can slowly build up the debugging information or tooling to decode these issues. For example, if malloc_assert
fails, maybe logging some stats about the heap if the assert is triggered could be super useful! And you don't have to add that hook all throughout the code-base, just in a single function and then use that function everywhere.
Really interesting and some good ideas. How could you assert scenario be extended so that they are only applied to non safety critical code. i.e if I had a web server linked to motor controller I don't want a web page fault causing the controller to crash, but still want to capture the fault?
Good idea here to not let a web page crash the controller. We wanted to prevent against the same at Pebble - a third party application shouldn't crash the main firmware.
In cases like this, I typically use an ASSERT_LOG()
function that at least logs out the line of code (PC/LR) and maybe an argument or two. The thing to be careful with here is that only the places where a full assert shouldn't take place should use this type of assert log. In most cases, the full assert should be used to raise errors/issues immediately and loudly to developers.
Hello,
thank you very much for a nice summary of using core dumps. I prefer defensive programming, as you mentioned in presentation and core dump see as last rescue. In case of hard fault it is last option, how to solve the issue in the field. We have big project now based on STM32F7 MCU with free rtos and without preemptive multitasking. I have headache, how to collect core dump in case of end less loop or any kind of deadlock. We are using independent watchdog, what is triggered in task, it works well, but is without any event before expiration. Window watchdog could be way, it has event before restart, but is too strict for us, trigger each 30-70 ms isn't for our application. Hasn't any advice for us, how to collect core dump in case of endless loop? In worst case it can happen in ISR, in task it is bad design, but can't to depend, that each developer does his/her job for 100 %.
Thanks