Home > On-Demand Archives > Q&A Sessions >
Live Q&A - Debugging Embedded Devices at Scale: Effective Techniques for Diagnosis and Resolution
Tyler Hoffman - Memfault - Watch Now - EOC 2023 - Duration: 24:47
Thanks for the talk .
We are implementing remote sensors via a FreeRTOS aware endpoint board using LoRaWan for the backhaul to a cloud UI.
We have been experimenting with Memfault, thus far uploading data via debug console dump and posting to the UI site, so far it's been a very positive experience.
How much data must we get to the Memfault platform to really make it useful for crash analysis? I would assume at least partial stack data for the faulting task, perhaps restart count, etc..
Any hints for trying to implement at least some Memfault functionality given the limited data transferred allowed within the LPWAN environment such at LoRaWAN?
Thanks
We've written about how to minimize bandwidth usage with Memfault in this document - it should answer many of your questions! https://docs.memfault.com/docs/best_practices/low-bandwidth-devices
Thanks Tyler
What are some techniques that you use to extract the core dump from your system? It seems to me that the system may have a difficult time recording this information at the very time that you need it most (e.g. inside the assert call). Do you still trust the driver that controls the non-volatile memory at this time? Do you try saving to the ROM somewhere?
The coredump is typically saved in noinit RAM or directly to external flash with a very minimal driver. I've personally used these methods to collect coredumps from hard faults, memory faults, memory corruption, asserts, and anything else, and I would say it works 99% of the time.
You can find a bit more documentation about how Memfault recommends saving coredumps to flash at https://docs.memfault.com/docs/mcu/coredumps/. I know Zephyr has a coredump storage that it uses as well, but I can't speak to it's reliability.
In terms of extracting the coredump from the system - it typically is chunked up into smaller packets and sent over BLE, Wi-Fi, LTE, or other protocol to a gateway and then sent somewhere a whole entity, or the chunks are forwarded directly to a server to be reassembled. Memfault's strategy of chunking data is detailed at https://docs.memfault.com/docs/mcu/data-from-firmware-to-the-cloud.
Further information about the commercial solution (so I understood from Q&A) can be found at https://docs.memfault.com/
What I'm still curious about: Is there an official site for the coredump component, beside probably the code in ESP-IDF, Zephyr, and so on ?
Absolutely. The Memfault documentation for coredumps can be found here: https://docs.memfault.com/docs/mcu/coredumps
Also, Memfault's firmware SDK is open-source, and can be found at: https://github.com/memfault/memfault-firmware-sdk
How to debug an IoT based solution where the modem stack has been developed by a vendor while the application code is under company’s supervision. I mean how to zero down on the RCA when the device is not up due to any uncertain reason(maybe hardware or software)
That's a tough one! You definitely need to have access to the source code and be able to modify it and get back debugging information.
Further information about the commercial solution (so I understood from Q&A) can be found at https://docs.memfault.com/
What I'm still curious about: Is there an official site for the coredump component, beside probably the code in ESP-IDF, Zephyr, and so on ?
For sure. You can find the coredump code for our SDK online in GitHub at https://github.com/memfault/memfault-firmware-sdk. In our documentation, you can find it under MCU -> Subsystem Guides -> Coredumps. https://docs.memfault.com/docs/mcu/coredumps
When looking at the memfault website: It advertises "free try" all over the place, but states exactly nothing about subsequent pricing or models. Not even a hint of a pointer, or did I miss it ?
Could you provide some insight, examples, anything ?
Thanks for sharing all of this Tyler! I really liked the idea of bundling asserts with other functions using macros. Going to see about taking advantage of that on some of the projects I'm working now...
Awesome! It's definitely a good way to get everyone on the team to use "best-practices" without trying to encourage that during code reviews.
The other side benefit of using the same wrappers is that you can slowly build up the debugging information or tooling to decode these issues. For example, if malloc_assert
fails, maybe logging some stats about the heap if the assert is triggered could be super useful! And you don't have to add that hook all throughout the code-base, just in a single function and then use that function everywhere.
Really interesting and some good ideas. How could you assert scenario be extended so that they are only applied to non safety critical code. i.e if I had a web server linked to motor controller I don't want a web page fault causing the controller to crash, but still want to capture the fault?
Good idea here to not let a web page crash the controller. We wanted to prevent against the same at Pebble - a third party application shouldn't crash the main firmware.
In cases like this, I typically use an ASSERT_LOG()
function that at least logs out the line of code (PC/LR) and maybe an argument or two. The thing to be careful with here is that only the places where a full assert shouldn't take place should use this type of assert log. In most cases, the full assert should be used to raise errors/issues immediately and loudly to developers.
17:04:59 From apr to Everyone: Really interessting and enjoyable presentation, spawning wanna-have-cravings. Can you share some information about integration into typical systems? 17:06:53 From apr to Everyone: Also, maybe even more important, when looking at the memfault website: It advertises "free try" all over the place, but says exactly nothing about any prices. Not even a hint of a pointer. Could you provide some insight? 17:08:03 From Dhanashree.Vaidya to Everyone: Ability to recreate a problem is critical when you don't have access to the system having the problem. How do you go about it? How can we debug if we cannot recreate them? 17:08:38 From Andrey Vlasov to Everyone: What was the most impactful addition to the coredump feature-set? 17:10:13 From Simon Smith to Everyone: Could you describe how to create a coredump if the RTOS doesn't support them natively? I use uC/OS-III for instance. Can a custom coredump handler be called to save a file of register contents if an assert handler is called? 17:10:19 From doolittl to Everyone: Do you have any (additional) tips on convincing people of the greatness of asserts, especially in places where the there is a real bug, but code can recover from it. For example an assert would allow you to debug/fix quickly, but recovering would have less impact to an end user. How do you balance those two. 17:10:51 From apr to Everyone: Is there a connection with or some integration with DevAlert from Percepio? 17:11:11 From Gillian Minnehan (she/her) to Everyone: I have a question (I can come off mute) 17:12:51 From Andrey Vlasov to Everyone: Awesome. thanks! 17:14:27 From Chris to Everyone: If your device is not online all the time and the coredumps have to be stored locally in the meantime, how do you find power/comm issues or even page buffer bugs with your external flash? 17:14:39 From Simon Smith to Everyone: Thanks. +1 for asserts also. 17:15:10 From apr to Everyone: Is there an official site for coredump code and/or documentation? 17:15:45 From Leroy to Everyone: I'm implementing via LoRaWan, limited cloud bandwidth. Any thoughts on the minimum amount of data required to get something useful within the Memfault UI? I would assume at least a portion of the executing thread? Or am I limited to just a crash notification but no data? 17:22:58 From Rishav Vatsa to Everyone: How to debug on a IoT based solution where the modem stack has been developed by a vendor while the application code is under company’s supervision. I mean how to zero down on the RCA when the device is not up due to any uncertain reason. 17:25:36 From Gillian Minnehan (she/her) to Everyone: Thanks Tyler! 17:26:17 From Amy McNeil to Everyone: Thanks Tyler! 17:26:49 From Simon Smith to Everyone: much appreciated 17:26:59 From Leroy to Everyone: Interrupt is a great blog 17:27:04 From Raul Pando to Everyone: Great as always, thanks! 17:27:22 From Tyler Hoffman to Everyone: Thanks everyone for the great questions!
Hello,
thank you very much for a nice summary of using core dumps. I prefer defensive programming, as you mentioned in presentation and core dump see as last rescue. In case of hard fault it is last option, how to solve the issue in the field. We have big project now based on STM32F7 MCU with free rtos and without preemptive multitasking. I have headache, how to collect core dump in case of end less loop or any kind of deadlock. We are using independent watchdog, what is triggered in task, it works well, but is without any event before expiration. Window watchdog could be way, it has event before restart, but is too strict for us, trigger each 30-70 ms isn't for our application. Hasn't any advice for us, how to collect core dump in case of endless loop? In worst case it can happen in ISR, in task it is bad design, but can't to depend, that each developer does his/her job for 100 %.
Thanks