Home > On-Demand Archives > Talks >

Demystifying Embedded: Techniques for Low-level Testing and Debugging

Gillian Minnehan - Watch Now - EOC 2023 - Duration: 31:08

Demystifying Embedded: Techniques for Low-level Testing and Debugging
Gillian Minnehan
Many embedded software engineers start their careers with the expected software testing and debugging skills such as writing unit tests, leveraging a visual debugger, finding memory leaks with valgrind, and, let’s face it, using print statements. However, embedded software development requires more embedded-specific strategies that many do not get the chance to learn in school. This talk shares the best techniques for understanding what a microcontroller is doing at a low level, such as reading fault registers, looking at disassembly, counting clock cycles, and using logic analyzers -- strategies used by newbies and veterans alike. Each strategy will be accompanied by a demonstration and all necessary instructions so listeners can easily apply what they have learned to their work.
M↓ MARKDOWN HELP
italicssurround text with
*asterisks*
boldsurround text with
**two asterisks**
hyperlink
[hyperlink](https://example.com)
or just a bare URL
code
surround text with
`backticks`
strikethroughsurround text with
~~two tilde characters~~
quote
prefix with
>

Ziemowit
Score: 0 | 3 weeks ago | no reply

Great presentation with ton if useful examples

arjunvinod
Score: 0 | 1 year ago | no reply

Great presentation, Gillian! One interesting bug I had recently encountered was a stack corruption issue. The issue was caused by the bootloader copying the application image and over-writing the bootloader stack. I spent many hours breaking my head to figure out the issue was in the linker scripts. Do you have any thoughts/tips to catch such issues quicker?

easyed
Score: 0 | 2 years ago | 1 reply

incredibly useful and thorough presentation Thanks

GillianMSpeaker
Score: 0 | 2 years ago | no reply

Thank you for the comment! Glad you enjoyed the talk.

srikrishnachaitanya
Score: 0 | 2 years ago | 1 reply

Great talk Gillian!
The section about the fault handlers was particularly fascinating..
Can I trouble you to point where to get the segger fault handler code you have shown ..
I d like to experiment more with this approach.. :-)

GillianMSpeaker
Score: 0 | 2 years ago | 1 reply

Srikrishna, glad you enjoyed it. You can find SEGGER's HardFault handler code at their wiki page here. You can see how I integrated it into my demo project on my Github here. Feel free to reach out via LinkedIn or email at gillian.minnehan@jhuapl.edu if you have any questions!

srikrishnachaitanya
Score: 0 | 2 years ago | no reply

Fantastic! :-)
Thank you very much Gillian!
I appreciate the example project very much.. It gives a nice starting point.

nathancharlesjones
Score: 3 | 2 years ago | 1 reply

Great rundown! In the section about toggling GPIO for execution information, you mentioned that one advantage is having finer granularity than that provided by the cycle count register. I understand that this is a result of the fact that, in using the cycle count register, the timing information is only represented as an integer multiple of the clock period. Don't both techniques suffer from the (possibly unknown) amount of time it takes the MCU to execute the instructions to either read from the cycle count register or to set the GPIO register, though? I had thought that it can takes hundreds of ns to update a GPIO output, which might swamp out any improvements in precision.

GillianMSpeaker
Score: 0 | 2 years ago | 1 reply

Nathan, good point. I think an improved statement is that both are reasonably accurate; while logic analyzers and oscilloscopes sample at a higher resolution than the cycle counter can count, there are multiple factors such that we can't definitively state one is more accurate than the other -- it depends on your setup. As you stated, without considering anything else, the logic analyzer should be more accurate; the Saleae is sampling at 500 MHz (2 ns per sample) vs the cycle counter counting at 48 MHz (20.8 ns per clock period). There is a small delay between the execution of the instruction that performs the GPIO toggle and the actual GPIO changing, since the output driver has a slew rate. That is a potential source of delay. There is also some load capacitance on the pin for the Logic Pro 8 (specs here) that I'm using (input impedance 2M Ohm with 10 pF) so the cap takes time to charge/discharge. And then yes, there is the time difference between the end of the execution of the last instruction in the code block under test and the end of the execution of the instruction that either toggles the GPIO or reads the value of the cycle count register. Plus, if we account for the Saleae and MCU clocks not being synced, then it is reasonable that the Saleae might take up to 1 sample clock before it sees the GPIO change (so 2 ns). I believe all of these factors are potentially contributing to the discrepancy shown in the demo where the toggle records a slightly longer execution time than the counter demo. Shoutout to my colleague Joe Trainor for help thinking through this one.

nathancharlesjones
Score: 0 | 2 years ago | 1 reply

Thanks for that clarification!

GillianMSpeaker
Score: 0 | 2 years ago | no reply

Certainly. Thanks for the question!

SimonSmith
Score: 0 | 2 years ago | 1 reply

That's very impressive. A simple technique I use often during a debugging session is to add some temporary global bools or ints so that different code branches can be executed and values changed easily, without having to waste time rebuilding and reflashing. Another technique I've found useful when reading and writing data files that are corrupted (and uncertain if it's the reading, and/or the writing), is to compare memory contents with the file in a hex editor, such as HxD (freeware!) or even write the file contents manually using it to check the reading. HxD has also been useful in locating corruption in the disk/partition structure, e.g. if using FatFS. I'd be interested to know if anyone knows of a tool for reading flash memory directly, as I had a similar problem once when developing the driver code, but no 3rd party way of seeing what's actually written to the device, until I got all the settings right and it just worked.

RaulPando
Score: 0 | 2 years ago | 1 reply

Hi Simon,
Could you please elaborate on the first technique you mentioned? It sounds interesting, but I'm not quite understand the lifetime of those variables and how (if any) they mutated (via a debugger?).

SimonSmith
Score: 1 | 2 years ago | 1 reply

Sure, something as simple as adding some globals (so they don't get optimised out and can be modified at any time in the watch window of the debugger), then deleted later on. This is totally made-up (sorry it removed all the indentation on submitting):-

bool doSomething = false;
int overrideSleepTime=0;
void fn(int sleepTime)
{
if (doSomething)
{
// run or don't run some extra code that wouldn't normally get called
}

if (overrideSleepTime > 0)
{
    sleepTime = overrideSleepTime;
}
HAL_Delay((uint32_t)sleepTime);

}

RaulPando
Score: 0 | 2 years ago | no reply

Oh I see, nifty! Thank you for the clarification.

GillianMSpeaker
Score: 0 | 2 years ago | no reply

I answered a question at my Q&A about unit testing, and the challenges that embedded developers face with unit testing. I mentioned that there is a great article from Memfault on getting started here: Embedded C/C++ Unit Testing Basics. Embedded Artistry also has a great collection here: Articles with Unit Testing Tag. My 2 cents is that off-target unit testing is very important, but it is important to also know when it makes more sense to test on hardware. It should be rare, especially if you are mostly doing standard, application-level embedded software, but it happens. There are a lot of factors. I've worked on a handful of teams, and they all struck different balances with off-target unit tests, on-target unit tests, and integration tests. I'm always striving to make more of my testing off-target, but a variety of factors pull me to sticking to hardware sometimes. As always, it depends. Maybe a topic for another talk.

GillianMSpeaker
Score: 0 | 2 years ago | no reply

There was a question at my Q&A about using VS Code for an IAR project. Someone mentioned there is support for IAR in VS Code via an extension, which I am now remembering I have used before. There is a basic guide from IAR here: Using VS Code for IAR Embedded Workbench. I don't use IAR anymore, but when I first started working, I dabbled with it a little bit and got it working with VS Code. I have some instructions written up on it, but I haven't tried them in a while and I don't think I have an IAR Server License anymore. If you're interested in those instructions anyways, shoot me a message (I'm on LinkedIn or you can get in touch at gillian.minnehan@jhuapl.edu).

MattBurkett
Score: 0 | 2 years ago | 1 reply

Great talk. I had an interesting debugging experience recently.

For context, I'm using an 8-bit, 24MHz AVR64DA64 micro from Microchip. First time working with this family of micro, so maybe this is an obvious feature of the chip to others... I set the chip to be a simple bare-metal round-robin-with-interrupts running at 1kHz and wanted to measure the time band utilization, so I measured a timer register and with a scope. What was strange was that the interrupts were firing normally, but the code was running much slower than it should. After much digging, it turned out to be that the peripheral clock (which was running something like 4MHz) not only controls the timers, ADC, USART, etc, but it also controls the RAM! So even though the main clock was running at 24MHz, the code could only execute as fast as the RAM was running. After I adjusted the peripheral clock to also be 24MHz and rescaled everything else, my time band utilization dropped accordingly.

The tools and techniques described by the presenter really are the gold standard for making sure the processor is doing what it's supposed to be doing!

GillianMSpeaker
Score: 0 | 2 years ago | no reply

Matt, that's a fun find. I looked up that data sheet (https://www.microchip.com/en-us/product/AVR64DA64) and yep on page 89 sure enough, RAM uses CLK_PER.

Thanks for the comment! Glad you enjoyed it.

DanCogan
Score: 0 | 2 years ago | 1 reply

Thank you for a well organized and practical talk.
I tend to work more often on the hardware (fpga) side but many of the same debugging ideas apply there. I learned some new things about registers available in modern embedded processors for help with debug. This is a great slide deck and video link to keep in my back pocket for next time I'm working in embedded SW again.

GillianMSpeaker
Score: 0 | 2 years ago | no reply

Dan, thanks for tuning in. I'm glad you learned a few things to help when you get back to embedded software. Feel free to reach out (my LinkedIn is in my speaker profile) if you have additional questions when you are re-reviewing the material.

DaveK
Score: 0 | 2 years ago | 1 reply

Nice summary of debugging techniques that new engineers should find useful. Even experienced engineers may not have used all of them. I particularly enjoyed the clock counting one. This could be useful when you don't have a scope or logic analyzer at your disposal.

GillianMSpeaker
Score: 0 | 2 years ago | no reply

Glad you enjoyed it Dave. Agreed, clock cycle counting is good option too if you don't have a gpio available to do the toggling. On a custom board, you might have every gpio called for.

DanR
Score: 1 | 2 years ago | 1 reply

One of my fsavorite go-to techniques is the HW breakoint on data access when you know what data is being corrupted (i.e. you have a gremlin that is consistent) but you don't know when it's being modified. This technique can make short work of an otherwise hard-to-discover bug.

Thank you for this survey of techniques -- they'd make an excellent training for incoming junior engineers to study.
-dan'l

GillianMSpeaker
Score: 0 | 2 years ago | no reply

Dan, agreed, hardware watchpoints are the best. For anyone reading who hasn't used them before, you can have gdb break whenever the data at a particular address is modified. I have also used that to find a corruption issue (My favorite was a buffer overflow that initially manifested as a semaphore take failing. The sem id was the piece of data getting overwritten accidentally.). You execute them with watch <expression> which could be watch my_struct.item if my_struct.item was getting corrupted. Or, directly use the memory address with watch *0x0FE00000 if data at the address 0x0FE00000 was getting corrupted.

OUR SPONSORS

OUR PARTNERS