My bizzare trip to SSD boot drives (and I gave up)

Status
Not open for further replies.

Jeff Chen

Dabbler
Joined
May 28, 2018
Messages
11
My FreeNAS complains about boot pool errors every time it scrubs itself. The checksum of one of the USB thumb drives would always become 1 and I'd have to go to the CLI to clean it up manually every week.

So with WD Green 120GB SSD becoming cheap enough (50 Canadian Rupees each tax incl.), I decided why not, and snatched two of them, hoping to make a mirrored SSD boot.
屏幕快照 2018-07-24 00.22.19.jpg


My plan was to detach one of the USB drives, attach one SSD, let it resilver, then replace the other USB drive with the other SSD. The USB drives are only 16GB, so I will have quite some space left, and I was gonna stripe them as swap. Beautiful, isn't it?


[Instant Failure]

And it crashed and burned.

The first SSD was in, it resilvered fairly quickly and I thought it's time to replace the 2nd USB drive, only to find checksum numbers growing on the SSD. I wasn't quite clear with what I was doing and switched out the USB drive anyway. I even went ahead and created the slices for the future swap partitions. And after the 2nd SSD was in, my boot pool was crippled beyond rescue.

Reinstall time it is then. I was "smart enough" to have made a latest config backup before switching out the first USB drive. So I downloaded the latest 11.1-U5 ISO, dded it into my other USB drive and started installing. It went on fine initially, but after rebooting from the SSD, all kinds of hell were let loose. The log was filled with OSError and things about client.py not happy and etc. When the WebUI was finally up, it asked me for a License Update. WHAT???
屏幕快照 2018-07-23 23.57.44.jpg



[Deep In the Mud]

So here comes the repeating process. I thought I made a botched USB install drive, so I ran sha256 to make sure the ISO was ok, and then used etcher to make sure things were verified after being transferred to USB drive. Installed again, a little bit of improvement and I was at the web UI, but errors were still being spewed out like crazy in the console. I imported the shiny new config, rebooted, only to find my system crippled. No plugin, no jail, no iSCSI target. Everything seems to require a re-saving to start working. In panic mode, I even rolled my jail dataset back to last night (no real loss), but nothing improved.

After a series of failures, I decided to try the old USB drives - at least they didn't give me hell back then - only annoyances. And they proved to be trust worthy. I was able to get everything back up and running in a few moments.


[Not Giving Up - Yet]

But I would not stop there. I thought maybe it was the motherboard SATA ports, or the SATA cables, so I switched them all, but to no avail. I thought it was the SATA power cable they were sharing with my fan hub, so I rearranged and isolated it from the drives - nothing.

Then, I decided to pull the USB drives and dd them into the SSDs. I was considering that maybe the installer was doing something wrong with the formatting process.

I was curious about my shiny new SSDs and even did some speed test before dding the data over, and they were doing 450MB/s read and 260MB/s write in a USB enclosure, connected to a USB hub - without any issue. Not bad!

The dd process was a breeze, and I got two booting SSDs within 10 minutes. Back into the FreeNAS box, they booted up fast and good, everything seemed to be running. I was like "Bingo!... Maybe?"


[Short Sweetness]

After playing with the Web UI a while, everything was so much more responsive, and nothing seemed to be going wrong. In the beginning, I scrubbed the boot volumes consecutively several times in a row, and no checksum error ever popped up. So I decided to proceed and reclaim the remaining space on those SSDs.

gpart wasn't happy with the partition table inconsistency and spat out a scary "corrupted" status, but it was easily fixed by the recover command. However, just was I was going to create the bigger partition, I noticed the infamous checksum numbers starting to grow symmetrically on the two SSDs. In a few moments, it grew into over a hundred and with the text "too many errors" next to them.
QQ20180724-003031.png


I rebooted the system and nothing seemed to be wrong in the console. I verified the installation and no error was found. However "zpool status -v" showed me some permanent errors in files and metadata. And the list kept growing slowly, around the infamous python 3.6 directory.


[A Man Should Know When To Give Up]

I simply could not tolerate my system running in this state. Who knows if the data corruption is real or not, or when it is going to make my system go haywire? So I pulled the SSDs out, and popped back in the USB drives. A false alarm per week is way better than a painful slow death at least.

My final thought is that probably these SSDs are having problem with trim, and deleting data it shouldn't. The consistency of data corruption surrounding specific files/directories could mean it's a bug in FreeNAS (FreeBSD?) or the WD SSD firmware (both very possible). Is it possible that trim and freebsd-boot partition type don't mix?

And for now, these SSDs will be serving as striped L2ARC for my main storage. I do run some iSCSI targets and NFS shares so it should help me with the tiny reads at least. They have been running for hours and no checksum errors at all.

<<<<<<<<<[TL;DR]>>>>>>>>>
So here's the summary of the fact I know:
  • SSDs and USB boot drives are fine
  • Motherboard port, cable or power not cause of problem
  • USB installer verified to be correct
  • Installer formatting disk probably not the cause of problem
  • Data on boot partition of SSD corrupt spontaneously over short span of time in the exact same directory
  • When SSDs are used as L2ARC, nothing goes wrong for hours

As for why I wanted a SSD-based swap - my system hasn't been maxed out with RAM yet (because they are so damn expensive at the moment), and when it starts to use swap, the whole system sort of freezes up. I checked and it seemed that one of my raidz2 drives is always busier than the others, and maxing out to 100% whenever it gets heavy. But it's another topic and I'm still trying to decide what could be going wrong.
 

Attachments

  • QQ20180723-222407.png
    QQ20180723-222407.png
    207.2 KB · Views: 491
Last edited:

anmnz

Patron
Joined
Feb 17, 2018
Messages
286
Sounds like the same installation corruption issue reported by @capa, @jonatkins and me, on WD Green, Sandisk SSD Plus, and Transcend (links to the forum posts below; there may be other reports I've missed too). @capa identified that all these SSD use similar Silicon Motion controllers.

There is a bug report: https://redmine.ixsystems.com/issues/35065

It might help someone if you could determine whether the problem goes away when you disable TRIM, as described in the bug report? More data points are always better.

Previous reports:
@capa (SanDisk SSD Plus and WD Green): https://forums.freenas.org/index.php?threads/boot-disk.63829/
@jonatkins (Transcend): https://forums.freenas.org/index.php?threads/transcend-ssd-boot-disk-zfs-checksum-errors.64321/
me (SanDisk SSD Plus): https://forums.freenas.org/index.php?threads/freenas-installer-sandisk-ssd-checksum-errors.64049/
 

Jeff Chen

Dabbler
Joined
May 28, 2018
Messages
11
Sounds like the same installation corruption issue reported by @capa, @jonatkins and me, on WD Green, Sandisk SSD Plus, and Transcend (links to the forum posts below; there may be other reports I've missed too). @capa identified that all these SSD use similar Silicon Motion controllers.

There is a bug report: https://redmine.ixsystems.com/issues/35065

It might help someone if you could determine whether the problem goes away when you disable TRIM, as described in the bug report? More data points are always better.

Previous reports:
@capa (SanDisk SSD Plus and WD Green): https://forums.freenas.org/index.php?threads/boot-disk.63829/
@jonatkins (Transcend): https://forums.freenas.org/index.php?threads/transcend-ssd-boot-disk-zfs-checksum-errors.64321/
me (SanDisk SSD Plus): https://forums.freenas.org/index.php?threads/freenas-installer-sandisk-ssd-checksum-errors.64049/

Thank you for the info. I'll definitely look into this and provide my test results asap.
 

brewnino

Cadet
Joined
Sep 13, 2017
Messages
8
Oh hey thanks for the warning. I was considering grabbing a cheap wd green to replace my 840 evo as a boot drive for freenas (no trouble with the 840 evo, btw)
 

Jeff Chen

Dabbler
Joined
May 28, 2018
Messages
11
Sounds like the same installation corruption issue reported by @capa, @jonatkins and me, on WD Green, Sandisk SSD Plus, and Transcend (links to the forum posts below; there may be other reports I've missed too). @capa identified that all these SSD use similar Silicon Motion controllers.

There is a bug report: https://redmine.ixsystems.com/issues/35065

It might help someone if you could determine whether the problem goes away when you disable TRIM, as described in the bug report? More data points are always better.

Previous reports:
@capa (SanDisk SSD Plus and WD Green): https://forums.freenas.org/index.php?threads/boot-disk.63829/
@jonatkins (Transcend): https://forums.freenas.org/index.php?threads/transcend-ssd-boot-disk-zfs-checksum-errors.64321/
me (SanDisk SSD Plus): https://forums.freenas.org/index.php?threads/freenas-installer-sandisk-ssd-checksum-errors.64049/

I tested the SSDs as boot drives again with the mentioned loader tunable and it worked very well for me. No more data corruptions. Thank you so much!
 

anmnz

Patron
Joined
Feb 17, 2018
Messages
286
I tested the SSDs as boot drives again with the mentioned loader tunable and it worked very well for me. No more data corruptions. Thank you so much!

Thanks for checking and updating the bug report. It is nice to have our observations further confirmed.

To be clear, I'm not recommending running the system long-term with that tunable and I would not keep using these SSDs. But (if you don't have any other SSDs in the system) I don't really have much of an argument for that. It's your system and your call.
 

Jeff Chen

Dabbler
Joined
May 28, 2018
Messages
11
Thanks for checking and updating the bug report. It is nice to have our observations further confirmed.

To be clear, I'm not recommending running the system long-term with that tunable and I would not keep using these SSDs. But (if you don't have any other SSDs in the system) I don't really have much of an argument for that. It's your system and your call.
Thank you for the advice. I do have an Intel 900P 280G in my system as a SLOG/L2ARC. I will turn back to my flimsy USB sticks soon.
 
Last edited:
Status
Not open for further replies.
Top