Home > On-Demand Archives > Q&A Sessions >

Live Q&A - Debugging Embedded Devices at Scale: Effective Techniques for Diagnosis and Resolution

Tyler Hoffman - Memfault - Watch Now - EOC 2023 - Duration: 24:47

Live Q&A - Debugging Embedded Devices at Scale: Effective Techniques for Diagnosis and Resolution
Tyler Hoffman
Live Q&A with Tyler Hoffman for the theatre talk titled Debugging Embedded Devices at Scale: Effective Techniques for Diagnosis and Resolution
M↓ MARKDOWN HELP
italicssurround text with
*asterisks*
boldsurround text with
**two asterisks**
hyperlink
[hyperlink](https://example.com)
or just a bare URL
code
surround text with
`backticks`
strikethroughsurround text with
~~two tilde characters~~
quote
prefix with
>

PetrKriz
Score: 0 | 2 years ago | no reply

Hello,
thank you very much for a nice summary of using core dumps. I prefer defensive programming, as you mentioned in presentation and core dump see as last rescue. In case of hard fault it is last option, how to solve the issue in the field. We have big project now based on STM32F7 MCU with free rtos and without preemptive multitasking. I have headache, how to collect core dump in case of end less loop or any kind of deadlock. We are using independent watchdog, what is triggered in task, it works well, but is without any event before expiration. Window watchdog could be way, it has event before restart, but is too strict for us, trigger each 30-70 ms isn't for our application. Hasn't any advice for us, how to collect core dump in case of endless loop? In worst case it can happen in ISR, in task it is bad design, but can't to depend, that each developer does his/her job for 100 %.
Thanks

LeroyLe
Score: 0 | 2 years ago | 1 reply

Thanks for the talk .
We are implementing remote sensors via a FreeRTOS aware endpoint board using LoRaWan for the backhaul to a cloud UI.
We have been experimenting with Memfault, thus far uploading data via debug console dump and posting to the UI site, so far it's been a very positive experience.
How much data must we get to the Memfault platform to really make it useful for crash analysis? I would assume at least partial stack data for the faulting task, perhaps restart count, etc..
Any hints for trying to implement at least some Memfault functionality given the limited data transferred allowed within the LPWAN environment such at LoRaWAN?
Thanks

tylerSpeaker
Score: 0 | 2 years ago | 1 reply

We've written about how to minimize bandwidth usage with Memfault in this document - it should answer many of your questions! https://docs.memfault.com/docs/best_practices/low-bandwidth-devices

LeroyLe
Score: 0 | 2 years ago | no reply

Thanks Tyler

Mike.Foss
Score: 0 | 3 years ago | 1 reply

What are some techniques that you use to extract the core dump from your system? It seems to me that the system may have a difficult time recording this information at the very time that you need it most (e.g. inside the assert call). Do you still trust the driver that controls the non-volatile memory at this time? Do you try saving to the ROM somewhere?

tylerSpeaker
Score: 1 | 3 years ago | no reply

The coredump is typically saved in noinit RAM or directly to external flash with a very minimal driver. I've personally used these methods to collect coredumps from hard faults, memory faults, memory corruption, asserts, and anything else, and I would say it works 99% of the time.

You can find a bit more documentation about how Memfault recommends saving coredumps to flash at https://docs.memfault.com/docs/mcu/coredumps/. I know Zephyr has a coredump storage that it uses as well, but I can't speak to it's reliability.

In terms of extracting the coredump from the system - it typically is chunked up into smaller packets and sent over BLE, Wi-Fi, LTE, or other protocol to a gateway and then sent somewhere a whole entity, or the chunks are forwarded directly to a server to be reassembled. Memfault's strategy of chunking data is detailed at https://docs.memfault.com/docs/mcu/data-from-firmware-to-the-cloud.

Andreas.Pretzsch.CNE
Score: 0 | 3 years ago | 1 reply

Further information about the commercial solution (so I understood from Q&A) can be found at https://docs.memfault.com/
What I'm still curious about: Is there an official site for the coredump component, beside probably the code in ESP-IDF, Zephyr, and so on ?

tylerSpeaker
Score: 0 | 3 years ago | no reply

Absolutely. The Memfault documentation for coredumps can be found here: https://docs.memfault.com/docs/mcu/coredumps
Also, Memfault's firmware SDK is open-source, and can be found at: https://github.com/memfault/memfault-firmware-sdk

rvatsa
Score: 0 | 3 years ago | 1 reply

How to debug an IoT based solution where the modem stack has been developed by a vendor while the application code is under company’s supervision. I mean how to zero down on the RCA when the device is not up due to any uncertain reason(maybe hardware or software)

tylerSpeaker
Score: 0 | 3 years ago | no reply

That's a tough one! You definitely need to have access to the source code and be able to modify it and get back debugging information.

Andreas.Pretzsch.CNE
Score: 0 | 3 years ago | 1 reply

Further information about the commercial solution (so I understood from Q&A) can be found at https://docs.memfault.com/
What I'm still curious about: Is there an official site for the coredump component, beside probably the code in ESP-IDF, Zephyr, and so on ?

tylerSpeaker
Score: 0 | 3 years ago | no reply

For sure. You can find the coredump code for our SDK online in GitHub at https://github.com/memfault/memfault-firmware-sdk. In our documentation, you can find it under MCU -> Subsystem Guides -> Coredumps. https://docs.memfault.com/docs/mcu/coredumps

Andreas.Pretzsch.CNE
Score: 0 | 3 years ago | no reply

When looking at the memfault website: It advertises "free try" all over the place, but states exactly nothing about subsequent pricing or models. Not even a hint of a pointer, or did I miss it ?
Could you provide some insight, examples, anything ?

Chase.Weimer
Score: 0 | 3 years ago | 1 reply

Thanks for sharing all of this Tyler! I really liked the idea of bundling asserts with other functions using macros. Going to see about taking advantage of that on some of the projects I'm working now...

tylerSpeaker
Score: 1 | 3 years ago | no reply

Awesome! It's definitely a good way to get everyone on the team to use "best-practices" without trying to encourage that during code reviews.

The other side benefit of using the same wrappers is that you can slowly build up the debugging information or tooling to decode these issues. For example, if malloc_assert fails, maybe logging some stats about the heap if the assert is triggered could be super useful! And you don't have to add that hook all throughout the code-base, just in a single function and then use that function everywhere.

Hammarbytp
Score: 0 | 3 years ago | 1 reply

Really interesting and some good ideas. How could you assert scenario be extended so that they are only applied to non safety critical code. i.e if I had a web server linked to motor controller I don't want a web page fault causing the controller to crash, but still want to capture the fault?

tylerSpeaker
Score: 0 | 3 years ago | no reply

Good idea here to not let a web page crash the controller. We wanted to prevent against the same at Pebble - a third party application shouldn't crash the main firmware.

In cases like this, I typically use an ASSERT_LOG() function that at least logs out the line of code (PC/LR) and maybe an argument or two. The thing to be careful with here is that only the places where a full assert shouldn't take place should use this type of assert log. In most cases, the full assert should be used to raise errors/issues immediately and loudly to developers.

17:04:59 From  apr  to  Everyone:
	Really interessting and enjoyable presentation, spawning wanna-have-cravings.
	Can you share some information about integration into typical systems?
17:06:53 From  apr  to  Everyone:
	Also, maybe even more important, when looking at the memfault website: It advertises "free try" all over the place, but says exactly nothing about any prices. Not even a hint of a pointer. Could you provide some insight?
17:08:03 From  Dhanashree.Vaidya  to  Everyone:
	Ability to recreate a problem is critical when you don't have access to the system having the problem. How do you go about it? How can we debug if we cannot recreate them?
17:08:38 From  Andrey Vlasov  to  Everyone:
	What was the most impactful addition to the coredump feature-set?
17:10:13 From  Simon Smith  to  Everyone:
	Could you describe how to create a coredump if the RTOS doesn't support them natively?  I use uC/OS-III for instance.  Can a custom coredump handler be called to save a file of register contents if an assert handler is called?
17:10:19 From  doolittl  to  Everyone:
	Do you have any (additional) tips on convincing people of the greatness of asserts, especially in places where the there is a real bug, but code can recover from it.  For example an assert would allow you to debug/fix quickly, but recovering would have less impact to an end user.  How do you balance those two.
17:10:51 From  apr  to  Everyone:
	Is there a connection with or some integration with DevAlert from Percepio?
17:11:11 From  Gillian Minnehan (she/her)  to  Everyone:
	I have a question (I can come off mute)
17:12:51 From  Andrey Vlasov  to  Everyone:
	Awesome. thanks!
17:14:27 From  Chris  to  Everyone:
	If your device is not online all the time and the coredumps have to be stored locally in the meantime, how do you find power/comm issues or even page buffer bugs with your external flash?
17:14:39 From  Simon Smith  to  Everyone:
	Thanks.  +1 for asserts also.
17:15:10 From  apr  to  Everyone:
	Is there an official site for coredump code and/or documentation?
17:15:45 From  Leroy  to  Everyone:
	I'm implementing via LoRaWan, limited cloud bandwidth.
	 Any thoughts on the minimum amount of data required to get something useful within the Memfault UI? 
	I would assume at least a portion of the executing thread?
	Or am I limited to just a crash notification but no data?
17:22:58 From  Rishav Vatsa  to  Everyone:
	How to debug on a IoT based solution where the modem stack has been developed by a vendor while the application code is under company’s supervision. I mean how to zero down on the RCA when the device is not up due to any uncertain reason.
17:25:36 From  Gillian Minnehan (she/her)  to  Everyone:
	Thanks Tyler!
17:26:17 From  Amy McNeil  to  Everyone:
	Thanks Tyler!
17:26:49 From  Simon Smith  to  Everyone:
	much appreciated
17:26:59 From  Leroy  to  Everyone:
	Interrupt is a great blog
17:27:04 From  Raul Pando  to  Everyone:
	Great as always, thanks!
17:27:22 From  Tyler Hoffman  to  Everyone:
	Thanks everyone for the great questions!

OUR SPONSORS & PARTNERS