Black Lives Matter - Action and Equality. ... Adafruit is open and shipping.
0

Flaky Flash Failure on Feather M0 Express?
Moderators: adafruit_support_bill, adafruit

Please be positive and constructive with your questions and comments.

Flaky Flash Failure on Feather M0 Express?

by mzero on Tue Jul 07, 2020 5:34 pm

My man woes all boiled down to this: My Feather M0 Express has a flash chip that reads back with errors at a very low rate: Reading the whole 2MB flash results, on average, 5 to 10 individual bit errors, each time a one read as a zero. Repeating the test on the same chip multiple times with resetting - results in different areas in different blocks, generally reading previously incorrect blocks as correct.

I created a minimal sketch for erasing, writing, and verifying test data. Running this sketch on the one other free Feather M0 Express board ran flawlessly.
Sketch is here: https://gist.github.com/mzero/2a2f669160dc5135d296ba610cfe66e1

So... Have I just got a bad board? A bad flash chip? Bad decoupling cap?

More worrisome is that I'm building out 100 custom boards with the same basic design as the Feather M0 Express - do I need to run a flash test on each of those? How common is it to see this kind of flash failure?

mzero
 
Posts: 16
Joined: Fri Sep 13, 2019 12:44 am

Re: Flaky Flash Failure on Feather M0 Express?

by reschue on Tue Jul 07, 2020 6:11 pm

How are you reading back the FLASH data and who/what decides if there is an error? Have you tried another/independent way of verifying the FLASH.

I wonder if it might just be a problem with the verification process - producing a false positive?

Rick

reschue
 
Posts: 75
Joined: Sun Jun 17, 2018 4:36 pm

Re: Flaky Flash Failure on Feather M0 Express?

by mzero on Tue Jul 07, 2020 6:55 pm

The chip is soldered to the Feather M0 board - so reading out another way is difficult.... But here are the ways that determined it was indeed a read failure:

First: Use the standard examples that come with Adafruit_SPIFlash library to erase the chip, then put a FAT12 file system on the chip. Now put the msc_external_flash example from the TinyUSB library on the device, and reset. At this point it is mounted as a USB drive my desktop. Now I can just use shell commands to write large files to it, and then copy them back - and finally use standard diff tools on the desktop to compare.

The result is that for larger files (> 100kB), there is almost always a few bits off. Ejecting the disk, letting it remount, and copying off again, results in a file that has different bit errors, now reading correctly where the prior copy read wrong. For larger files, you get more errors, about 1 per megabit. The errors are always 1 changed to a 0, and they don't seem to be in a consistent location within a block. This whole test implies that the data was written correctly (as at one time or another any given block can be read correctly).

Second: Because the above test involves several complex libraries, I wrote the sketch listed in the gist. This relies on nothing but Adafruit_SPIFlash. This code fills the flash with 512 byte blocks that have a fixed header and a reproducible pseudo-random sequence of data, different for each block. The code can read flash blocks, and if they have the fixed header, it then verifies that all the remaining bytes match the pseudo-random sequence for that block.

The results are the same: About one in a million bits gets changed from a 1 to a 0. It is only on read, as redoing the test finds bit errors in different locations. Given that the rest of the sequence, and most blocks, all check out - I think the code is correct.

But just to be triply sure, I duplicated the generation code in Python, created a test file, and, like above, mounted the flash via SDFat - wrote that file (full of 512 blocks of the test data) to the flash. Finally, putting the tester code back on, I ran just the verification step -- this is reading blocks written by SDFat, though it knows nothing of the file system structure. Because of the fixed header, it can identify test data blocks, and again verify them. Result is the same bit error rate and (non-)pattern.

mzero
 
Posts: 16
Joined: Fri Sep 13, 2019 12:44 am

Re: Flaky Flash Failure on Feather M0 Express?

by adafruit_support_mike on Wed Jul 08, 2020 3:14 am

It sounds like you've done an excellent job troubleshooting the issue.

Send a note containing a link to this thread and your order number to support@adafruit.com. The folks there will arrange a replacement.

adafruit_support_mike
 
Posts: 61197
Joined: Thu Feb 11, 2010 2:51 pm

Re: Flaky Flash Failure on Feather M0 Express?

by reschue on Wed Jul 08, 2020 7:50 am

Just curious - if you lower the SPI clock rate (by a factor of 2, perhaps) do the errors go away.

Rick

reschue
 
Posts: 75
Joined: Sun Jun 17, 2018 4:36 pm

Re: Flaky Flash Failure on Feather M0 Express?

by mzero on Wed Jul 08, 2020 11:57 am

It took some monkeying with the SPIFlash library, but I built a version of the tester to try different clock rates.

The SPI clock rate that the library normally chooses for this chip is 48MHz - which is the clock rate of the processor. The chip descriptor says it'll use up to 104MHz.
  • I halved the clock rate - same error rate.
  • I quartered the clock rate - no errors. ☜ ☜ ☜

This was verifying data programmed into the chip with the original clock speed. This suggests to me that issue is in the chips SPI generation. If, for example, the power to the chip, or decoupling cap, were just on the hairy edge of spec, then the output transistors on the chip could be a smidge slow, causing occasional errors.

mzero
 
Posts: 16
Joined: Fri Sep 13, 2019 12:44 am

Re: Flaky Flash Failure on Feather M0 Express?

by reschue on Wed Jul 08, 2020 12:56 pm

Sounds like we're thinking along similar lines.

Given that it programs correctly at the higher clock rate, I agree with you that the problem might be more with the MISO line rather than MOSI. The only way to tell would be to get access to a good DSO and check for a signal integrity problem - like cross talk or excessive capacitive loading causing a rise-time issue.

I'm a bit surprised that it took that much reduction in clock rate to make the errors go away. With the low error rate you were seeing at "full" speed, I was thinking you were on the hairy edge of proper operation, something maybe a 10% reduction in clock rate would have fixed. Are there any other SPI slaves sitting on the MISO line. If so, I wonder if one of them can't handle the high frequency clock and might be corrupting MISO.

Rick

reschue
 
Posts: 75
Joined: Sun Jun 17, 2018 4:36 pm

Re: Flaky Flash Failure on Feather M0 Express?

by mzero on Sun Jul 12, 2020 1:40 am

A follow up for the interested:

I looked at the SPI SCLK & MISO lines on two Feather M0 Express boards:

good board:
SDS00004.png
SDS00004.png (33.83 KiB) Viewed 12 times


bad board:
SDS00003.png
SDS00003.png (35.23 KiB) Viewed 12 times


You can clearly see the ringing on the bad board. These traces were taken with SCLK set to 1MHz. Even with the ringing, at this speed both boards read the flash chip fine. If you bump SCLK up to 24MHz or 48MHz, however, that ringing is enough to cause the read errors I was seeing.

mzero
 
Posts: 16
Joined: Fri Sep 13, 2019 12:44 am

Re: Flaky Flash Failure on Feather M0 Express?

by reschue on Sun Jul 12, 2020 8:56 am

I give you credit for taking the time to investigate this. Good work!

Here's my take on your findings:

The ringing I see on the second screen capture doesn't look like it's enough to cause your read error. Actually, MISO looks pretty good. It's difficult to get a good picture of edge transitions as scope probes need good, short ground connections to prevent them from creating undershoot/overshoot all by themselves.

To cause the errors you're seeing (1's read as 0's), there would need to be a significant droop in the MISO line after it transitions high and at the same time SCLK is going from low to high (I'm assuming MISO is sampled by the CPU on the low-to-high transition of SCLK). I don't see any "droop" on MISO below the nominal "1" level - certainly not enough to get near the threshold region (about 1.5 V) where it could be seen as a "0".

It's hard to tell from this picture exactly what MISO looks like 10 nsec after it goes high (when it would be sampled by the full speed 48 MHz SCLK) but it doesn't looks like it's coming back down very far. The basic signal integrity looks pretty good.

The next step is probably to focus on asynchronous events, like interrupts, DMA activity, etc. that might cause that one-in-a-million glitch in the transfer and drive a "1" to a "0". You'd need to capture the actual faulty transfer by triggering the DSO (by driving a GPIO pin high/low) when the verify code detects the error (running at full speed, of course). Then by looking back in the trace buffer try to isolate the exact transfer that failed knowing the data values of adjacent transfers. ....and you're going to need a faster scope.

Very tedious work, as it is with any attempt to find needles in haystacks.

Rick

reschue
 
Posts: 75
Joined: Sun Jun 17, 2018 4:36 pm

Please be positive and constructive with your questions and comments.