Home > On-Demand Archives > Talks >
Essential Device and Firmware Metrics
Tyler Hoffman - Watch Now - EOC 2022 - Duration: 40:42
As embedded engineers, we love data. Our desk is littered with tools that help capture tons of data, such as oscilloscopes, logic analyzers, debuggers, tracers, and power meters. However, once a device (or thousands) leave our desk and are shipped to customers, all of these tools are paperweights. It's now up to the devices to report issues back to the developers.
This is where metrics come in. Throughout my career, metrics have been the most powerful and simplest way to monitor thousands to millions of devices. This talk covers what metrics are, how to capture them, and explores the seemingly infinite metrics you can capture and creative uses to help you solve real-world, elusive device issues, such as power consumption, performance, and battery life issues.
Thanks for the talk Tyler!
How would you get more insights of metrics during offline periods? In devices with connectivity issues, sometimes things are not clear enough if metrics are aggregated.
For example, in our application we collect and send metrics every 2 hours, and our metrics are monotonic (i.e. we only increment counters, do not reset them nor aggregate them).
If a device is offline for some days, the metric upload task simply fails and that snapshot of metrics is lost. So when the device reconnects it continues reporting it's metrics and we have all what happened during the offline period summed up in the first uploaded report after reconnection. So we don't have knowledge of the evolution of the metrics during the offline period.
I guess we could store the reports with timestamps until the device is online again, or average the metrics during the offline period, but what are your thoughts?
Great presentation as always. Extremely practical information and helpful for any engineering team.
Tyler,
Thank you for the presentation. You gave me some good food for thought.
You mention the Firmware Metrics Library being very simple and scattered everywhere. You also mentioned RTOSs at various point during the video. Pending on a Mutex in order to store log data in a central location could be a very heavy call. It also presents a priority inversion risk if we're using a single mutex.
Some ideas to work around this are:
- Use a message queue type of kernel service if the RTOS offers it to post metrics to a metric receiver task
- Each task maintains their own metrics buffer and only obtains a mutex to sync metrics to a central location periodically
- Have a metric scraper task runs and pull all of the individual task metric buffers. This task would run at the highest priority, or lock the scheduler or use a critical section if the task runs at the lowest priority in order to avoid the use of mutexes. In either case each task writing the metrics would need to use a critical section when performing the writes.
Is there a better approach than those, or one among those you've used personally and prefer?
How do you recommend gathering metrics on an embedded Linux system with a lot of processes? We're using systemd/journald for logging which handles gathering and transmission for us (by virtue or rsyslog or syslog-ng). Are there some similar off the shelf services we could employ, or would we need to roll our own system using named pipes or similar to some central uploader daemon?
Good morning Tyler, thanks for a great presentation. Do you have any thoughts about collecting data from a product once it is released to customers and how this might affect privacy issues?
Add the metrics library, and start collecting data! As far as privacy issues, it all depends on what you are collecting. If you are collecting information about task utilization, flash write statistics, stability, battery life, connectivity connection durations, etc. then there really shouldn't be an issue on privacy. If you need to track information about the environment the device is operating within, like SSID name, obfuscate it. Capture the last few characters or hash it.
That's unfortunately exactly what I'm saying not to do. It makes collecting metrics much more difficult, especially during offline periods or between reboots. My recommendation is to change how the metrics are recorded on the device.
You can read more on why here: https://interrupt.memfault.com/blog/device-heartbeat-metrics