How to avoid end of life from NAND correctable errors

Thom Denholm - TUXERA - Watch Now - EOC 2020 - Duration: 26:02

Abstract Questions & Comments (14)

Flash media is fabulous for most use cases, but heavy reads can cause correctable errors. Linux flash file systems actually shorten the life of the media when dealing with these errors. How does this change with multiple bits per cell, including recent QLC NAND? What other sorts of media management can help get the most lifetime out of your flash media based device?

This talk will cover these sorts of problems and impacts in detail, from flash file systems to SSDs and other NAND flash-based media. While we can't speak to what the firmware in your devices are doing, we have an excellent knowledge of what they should be doing, and also detail the sorts of conversations a system designer should have with their flash media vendors.

M↓ MARKDOWN HELP

italics	surround text with asterisks
bold	surround text with two asterisks
hyperlink	[hyperlink](https://example.com) or just a bare URL
code	surround text with `backticks`
~~strikethrough~~	surround text with ~~two tilde characters~~
quote	prefix with >

Upvotes Newest Oldest

HardRealTime

Score: 1 | 5 years ago | 1 reply

Thanks for the talk! Here's hoping that you are still checking the comments. The flash memory device that I'm using has a special OTP (one-time-programmable) section separate from the main array. How does the OTP section compare to the main array? Or in other words, could the OTP section begin failing with bit errors just as a page from the main array?

ThomDenholmSpeaker

Score: 0 | 5 years ago | no reply

Absolutely still keeping an eye on this, glad you enjoyed it!

The use case is really important here. If the OTP section is read infrequently and unlikely to be affected by write or program disturb, then the likelihood of bit errors occurring there is correspondingly low. But the likelihood is not zero, so yes, bit errors can cause a failure there.

Presuming you want the OTP section to remain in the same flash media blocks, a special-case scrubbing could rewrite the data in-place, first erasing OTP pages. Power failure would need to be protected against - this is just as important as a BIOS update in this respect.

ThomDenholmSpeaker

Score: 3 | 5 years ago | 1 reply

To answer the most frequently asked question, yes, I would be happy to send slides to whoever wants them. May I recommend instead the whitepaper I wrote on this topic? Available here - https://www.tuxera.com/nand-correctable-errors/

ThomDenholmSpeaker

Score: 0 | 5 years ago | no reply

Slides now uploaded and available at the link on the left.

Justin

Score: 2 | 5 years ago | 1 reply

Hi Mr. Denholm! Thanks so much for the talk! I never thought I'd see Galois fields at 10:16 (or I guess they're more commonly known as generator polynomials in field theory, never been really good at cryptography haha)

ThomDenholmSpeaker

Score: 1 | 5 years ago | no reply

You are very welcome! The math of protecting a lot of bits with a handful of additional bits is fascinating, and could easily be a separate talk. What I'd like to know more about is RAID, which is indistinguishable from magic for me at this point.

DrewFustini

Score: 5 | 5 years ago | 1 reply

I have a ARM single board computer running Linux with ext3 filesystems on both microSD and eMMC. Where would error correction be handled in those situations? The controller inside the eMMC chip and the microSD card?

My understanding is that eMMC is supposed to be more reliable but I am not sure why

ThomDenholmSpeaker

Score: 0 | 5 years ago | no reply

Hi Drew,
Both microSD and eMMC have NAND memory and firmware. Between those are controllers - internal on the eMMC, either internal or external on the microSD. The firmware works with the controller to detect and correct errors.
The interface for the eMMC is more robust than for the microSD, and allows for more options to control the device. One example is power loss notification, so that a design can stop writing to the media.
Additionally, most vendors of eMMC provide an additional "pseudo-SLC" mode for their MLC chips. This stores only a single bit in each cell, which halves the storage but increases the lifetime and robustness. The microSD vendors give no options like that - the interface doesn't define any.
Those are two examples of how eMMC can be more robust than microSD, but overall reliability is dependent on the design. The microSD has advantages also - it can be removed or replaced, while eMMC is fixed onto the board.
Tuxera is a member of JEDEC (who manages the eMMC specification) and is a board member of the SD Association.

DrewFustini

Score: 4 | 5 years ago | 1 reply

Does the Linux MTD driver need to know the maximum lifetime specs for a given NAND chip?

ThomDenholmSpeaker

Score: 2 | 5 years ago | no reply

Interesting, my answer to this question seems to have been "discarded" :)
My understanding of MTD is that it wouldn't do anything with the lifetime specs if it had them. It is a fairly simple interface, and doesn't do any sort of wear leveling or bad block management - those are all left to the flash driver or flash file system running on top of MTD.

DrewFustini

Score: 3 | 5 years ago | 2 replies

Does Linux send Trim commands to a SSD, or just for raw NAND chips?

ThomDenholmSpeaker

Score: 2 | 5 years ago | no reply

Related to this, if Trim commands are not used at all, a serious performance drop occurs once the system has been completely filled the first time. We noted that in my blog post
https://www.datalight.com/blog/2017/05/31/performance-drop-without-discards/
It seems likely that SSDs may have some form of firmware garbage collection to improve this, but they don't have the knowledge that the file system has of which data blocks are still in use.

ThomDenholmSpeaker

Score: 1 | 5 years ago | no reply

Linux Trim commands come from the file system, and they are configured differently. As one example, ext4 uses Trim commands if configured to do so through the discard mount option. They have two options - on or off.
Another option used for better performance is an external daemon to perform bundles of discards on a timed basis - balancing the lifetime benefit of discards with the performance benefit of scheduling them.

ThomDenholmSpeaker

Score: 0 | 5 years ago | no reply

Hi folks, I am online and ready to answer any questions you might have!

Login

Topic(s) Covered

About Thom Denholm

About TUXERA

Slides

How to avoid end of life from NAND correctable errors

OUR SPONSORS & PARTNERS