SSD Throwing Alerts? How to diagnose?

Douche_Baguette · Jul 8, 2015

Hi everyone!

So, I've been running a freenas server for a while now, with my extra, spare drives. For a long time, the SSD was sitting idle, and the 1TB drive housed my jails and a volume of storage for music, movies, data, etc. The other 2 drives were striped as a single volume for Time Machine backups.

Well, the 1TB drive started throwing errors and reporting unreadable sectors, so I figured this was my chance to rebuild the system, better. So my current setup is:

250GB Samsung 840 Evo ("SSD") as my jails storage
3x2TB drives setup in ZRAID1 ("RAID") (or whatever it's called) for single disk redundancy and ~4TB of storage.

I have a Plex jail and a jail running an apache web server. I have the SSD set to never sleep/standby. The hard drives sleep after an hour.

Anyway, everything is pretty much working fine, but this morning I got an email at 3AM with a critical alert:

The volume SSD (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Weird. Then, around 2PM:

The volume SSD (ZFS) state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

and 5 minutes later:

The volume SSD (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

Then again at around 5, when streaming a movie with Plex, I got the same 2 emails back to back ("data corruption" and "unrecoverable error"). Even so, Plex didn't stutter and everything seems to be working fine.

If I SSH into freenas and run zpool status, I get:

pool: SSD

state: ONLINE

status: One or more devices has experienced an unrecoverable error. An

attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

using 'zpool clear' or replace the device with 'zpool replace'.

see: http://illumos.org/msg/ZFS-8000-9P

scan: scrub repaired 0 in 0h0m with 0 errors on Wed Jul 8 17:27:46 2015

config:

NAME STATE READ WRITE CKSUM

SSD ONLINE 0 1 0

gptid/7ccc476e-238b-11e5-bf3b-6805ca145a40 ONLINE 0 1 0

errors: No known data errors

As you can see, I tried a scrub, and it came back with no errors, and no repairs.

What could be causing this? What else can I do to troubleshoot? The SSD is pretty new and was used as my desktop's boot drive for maybe 6 months, then sat unused until now. I REALLY doubt the drive is actually failing, but I guess it's possible.

My motherboard's sata ports are SATA 2, so I am using a PCI express SATA 3 card for the SSD only. Should I try taking that out of the equation and running the SSD directly to the motherboard? It is, FWIW, a "crappy gamer-grade motherboard".

Full system specs:

ASUS P6T motherboard
12GB 1600MHz triple-channel DDR3 memory
Intel Core i7 920 2.66GHz
Antec 650W power supply
No graphics card installed during normal use.

hugovsky · Jul 9, 2015

Code:

NAME STATE READ WRITE CKSUM

SSD ONLINE 0 1 0

gptid/7ccc476e-238b-11e5-bf3b-6805ca145a40 ONLINE 0 1 0

Your drive IS failling. Post the result of smartctl -a /dev/"your_ssd" between code tags.

hugovsky · Jul 9, 2015

oh... and freenas version, please.

Ericloewe · Jul 9, 2015

Douche_Baguette said:
I am using a PCI express SATA 3

Never a good idea. There literally isn't a single PCI-e SATA 3.0 controller I would trust with my data.

Douche_Baguette said:
250GB Samsung 840 Evo ("SSD") as my jails storage

The 840 Evos had some weird, nasty bugs, IIRC.

Robert Trevellyan · Jul 9, 2015

Ericloewe said:
The 840 Evos had some weird, nasty bugs, IIRC.

The 840 EVOs have an issue with read performance degradation for data that is rarely accessed.

Douche_Baguette · Jul 9, 2015

Alright guys, I have an update.

First of all, thanks for spending your time helping me!

Second, here's what happened. I went to bed last night having posted this thread, and this morning, I saw hugovsky's replies. So I went to SSH into the server, and it wouldn't connect. Weird. So I tried to access the web interface, which also didn't work. I hadn't received any emails from the server.

So I hard-powered-down the server and then powered it back up, waited probably 15-20 minutes, and it still wasn't responding via SSH or web.

So... I shut it off again, dragged it upstairs, installed the graphics card, and booted it with a monitor attached (TV, in this case). This is where it gets really weird. I took a photo on my phone where the boot process stopped: http://i.imgur.com/QlC2ifN.jpg

You can see on the bottom line, it has seemingly random characters. I won't speculate too much on what could cause that, but I will be running a full memtest86 as soon as I have a chance.

I removed the PCIe SATA 3 card and plugged the SSD into the motherboard with a different SATA cable and booted up. It booted normally without any errors (which it was doing before; so this doesn't necessarily mean anything). Anyway, here's the output of the SMART command:

Code:

[root@lingernas] ~# smartctl -a /dev/ada3

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Model Family:     Samsung based SSDs

Device Model:     Samsung SSD 840 EVO 250GB

Serial Number:    S1DBNEAD713418W

LU WWN Device Id: 5 002538 850013d7b

Firmware Version: EXT0BB0Q

User Capacity:    250,059,350,016 bytes [250 GB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    Solid State Device

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Thu Jul  9 10:48:23 2015 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x00)Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:      (   0)The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 4800) seconds.

Offline data collection

capabilities:  (0x53) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003)Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01)Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time:  (   2) minutes.

Extended self-test routine

recommended polling time:  (  80) minutes.

SCT capabilities:        (0x003d)SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.


SMART Attributes Data Structure revision number: 1

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       20540

12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       185

177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       4

179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0

181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0

182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0

183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0

187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0032   071   062   000    Old_age   Always       -       29

195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0

199 CRC_Error_Count         0x003e   067   067   000    Old_age   Always       -       32484

235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       60

241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       2612857845


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Running FreeNAS-9.3-STABLE-201506292130

Ericloewe · Jul 9, 2015

Didn't they have some bricked drives with early firmware, too?

Douche_Baguette · Jul 9, 2015

As for the SSD, I am not an expert, but presumably I am not suffering from read performance degradation of any kind because the error(s) in question seem to be for writing data. Also I did use this SSD on my windows desktop for around 6 months with absolutely no issues, so although it's still possible that it's failing, it was not DOA. I'm sure I updated the firmware with Samsung Magician when I first got the drive as well.

Ericloewe · Jul 9, 2015

Douche_Baguette said:
199 CRC_Error_Count 0x003e 067 067 000 Old_age Always - 32484

Depending on the meaning of that value, it's quite likely there was/is a problem elsewhere in the SATA chain. Bad cables are typical, but crap SATA controllers can also cause CRC errors..

Douche_Baguette · Jul 9, 2015

Ericloewe said:
Depending on the meaning of that value, it's quite likely there was/is a problem elsewhere in the SATA chain. Bad cables are typical, but crap SATA controllers can also cause CRC errors..

Okay, interesting. So I might be totally fine having removed the cheap controller AND replacing the cable?

Ericloewe · Jul 9, 2015

The odd thing is that CRC errors would be mostly transparent to ZFS, barring the additional latency of retransmitting stuff (and reduced bandwidth). It could be the crap controller acting weirdly, though. Maybe trying and failing twice in a row and returning a read error to the OS, for instance.

Robert Trevellyan · Jul 9, 2015

Ericloewe said:
Didn't they have some bricked drives with early firmware, too?

I don't remember that, and I have been paying fairly close attention because I installed 840 EVOs for several clients (offered them free support for the firmware updates addressing the performance degradation issue).

Ericloewe · Jul 9, 2015

Robert Trevellyan said:
I don't remember that, and I have been paying fairly close attention because I installed 840 EVOs for several clients (offered them free support for the firmware updates addressing the performance degradation issue).

Maybe I have them confused for a different model, then.

My two 830s (128GB and 256GB) have been exemplary, as has my 850 Pro (512GB).

cyberjock · Jul 9, 2015

This thread just gives me lots of room to call "fail". :(

1. Not a recommended motherboard brand. Lesson learned with AsRock that brand *does* matter.
2. i7-920 doesn't support ECC. I have a spare system that uses that CPU (hooray for 1st gen i7s). I use it as a simple test box for FreeNAS, never stores any data except data I create from /dev/urandom, so not having ECC RAM isn't a problem. On my board (Gigabyte board) I can put in a Xeon and ECC RAM and have support for ECC if I wanted it.
3. That ugly crash with random characters really makes me thing your RAM is bad (or you have some kind of major hardware failure or incompatibility), which pretty much means your zpool is probably trashed too (that's why we recommend ECC RAM without exception). I think it's more than likely RAM though.
4. EVO SSDs are, in my opinion, totally unfit for any purpose besides a low-importance desktop. They should never be used in a server. Again, in my opinion. I've always felt this way about the EVOs since those are TLC, and we keep hearing only more and more bad news with the TLC. I'm sure the current read performance issue won't be the last either. Samsung took the TLC to market way, way too early without adequate understanding of the behavior. They wanted to break even or make a profit from all the R&D of the TLC memory before MLC memory got so inexpensive and so dense that the TLC would not have been economically feasible. In my opinion, any SSD with TLC memory is unfit for any purpose except low-importance desktops.
5. Used RAIDZ1.
6. I doubt anyone here could vouch for any SATA controller that is an add-on. The go-to is the M1015 because we already know that the add-on cards are built by the lowest bidder, which should make you shudder when you are trying to build the most reliable system you can and trying to use ZFS.

Douche_Baguette · Jul 9, 2015

Cyberjock, thanks for your input.

As you probably suspect, most of the system components are leftover consumer/prosumer desktop parts. I understand that ASUS is not a "recommended" brand for a server motherboard, and I understand that ECC memory is recommended as well. However, I don't want to spend the money required on a new motherboard, processor, and memory (at the minimum) just to put myself in the "recommended hardware" category. This being the case, all of my data is also backed up offsite, including the data in the RAID array, so any catastrophic failure can only be so catastrophic. It's not the end of the world for me if the thing melts down at some point.

I've removed the add-on sata controller from the equation, so hopefully that helps things. And like I said, I plan on running a full memtest86 test after the weird halt I got - to see if any of the memory is bad.

But my question is this... what's wrong with RAIDZ1? Other than the second-disk redundancy that RAIDZ2 offers, what difference does it make (to me)? Am I missing something? I read the post in your signature. Are you trying to say that at the manufacturer's specified average of one URE per 12TB, I have a 30% chance of not being able to recover my 4TB of data if one drive fails (assuming I had the drives full)?

Ericloewe · Jul 9, 2015

During the long time you don't have redundancy, any single unreadable sector will result in some corruption.

RAIDZ2 is better because it can recover from disk issues even during the common degraded scenario of "One disk is down".

In other words, any situation where you don't have any leftover redundancy is dangerous. RAIDZ2 makes it very unlikely that you'll be in that situation. RAIDZ3 makes it crazy unlikely.

Douche_Baguette · Jul 9, 2015

Okay, that's basically how I was understanding it. Thanks.

cyberjock · Jul 9, 2015

Douche_Baguette said:
However, I don't want to spend the money required on a new motherboard, processor, and memory (at the minimum) just to put myself in the "recommended hardware" category.

You misunderstand the purpose for the recommended hardware. It's not so we can have a category for "recommended hardware", have a claim to fame, force people to spend money or anything of the sort. Its to avoid people losing data, having problems that waste everyone's time (yours, mine, and the other volunteers here on the forum) with problems that we know for 100% certainty we cannot fix with software and avoids filling the forum with a bunch of people complaining about problems and being ignored because the regular readers and posters will eventually decide that it's not worth being here or responding to threads. The ignore user feature works very well on this forum!

If everyone built their systems with random prosumer boards we'd have no forum members here. We used to be that way, and we had such a high turnover of our volunteers that nobody benefited. People *do* get tired of telling people not to use Realtek NICs, don't use non-ECC RAM, don't try to use Marvell SATA controllers, etc. There's a reason you don't see me post even a fraction of what I used to. If you can't be bothered to listen to our recommends and understand why we have our recommendations, I don't want to spend even more time to tell you that you still cannot use the hardware that we've already told everyone 100 times. We spent many many hours putting together the information we have in our stickied threads. Choosing to be someone that blatantly ignored our hours of hard work spent putting together well documented threads of what does and doesn't work won't help anyone's case. You get put on the ignore list and the next time you have a problem nobody responds. Not good for you and not good for a forum to have threads that are unanswered.

I just responded to a thread yesterday that nobody, not one person, responded to. Why? Because he ignored at least a dozen different points on this forum, and now is desperately asking for help because his precious data is currently lost. I was going to ignore him until he PMed me asking for help. So I told him why I didn't respond and left it at that.

Douche_Baguette · Jul 9, 2015

cyberjock said:
If you can't be bothered to listen to our recommends and understand why we have our recommendations, I don't want to spend even more time to tell you that you still cannot use the hardware that we've already told everyone 100 times.

cyberjock said:
Choosing to be someone that blatantly ignored our hours of hard work spent putting together well documented threads of what does and doesn't work won't help anyone's case. You get put on the ignore list and the next time you have a problem nobody responds.

...I completely understand the purpose of recommended hardware. Having everyone using the finest quality, least prone to error hardware makes the job of troubleshooters much easier because it can be "ruled out".

But it's a hilariously unrealistic expectation to think that people won't use the hardware they already have sitting around with this free software.

That having been said, if you truly believe that my issues are caused by my inferior hardware, then so be it. That's fine. If the true answer to my questions is "it's not working because your motherboard or RAM or processor or SSD isn't good enough", that's fine. Feel free to say that. If you don't even want to try to help because I'm not using exclusively server-class hardware, again, that's fine. But there was also a possibility that my problem was caused by a configuration issue or simple oversight or problem that I didn't forsee. Surely you don't mean to say "don't even ask for help unless you're exclusively using hardware on our short list"? Am I an asshole for asking for help even though I don't have a Xeon processor and ECC memory?

Edit: For what it's worth, after removing the PCIe SATA controller and connecting the SSD to the motherboard with a new SATA cable, I haven't had a single issue. So, fingers are crossed, but I think that was the issue. I ran a memtest86 cycle and there were no errors. Thanks to those who helped me troubleshoot!

Important Announcement for the TrueNAS Community.

SSD Throwing Alerts? How to diagnose?

Cadet

Guru

Guru

Server Wrangler

Pony Wrangler

Cadet

Server Wrangler

Cadet

Server Wrangler

Cadet

Server Wrangler

Pony Wrangler

Server Wrangler

Inactive Account

Cadet

Server Wrangler

Cadet

Inactive Account

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SSD Throwing Alerts? How to diagnose?"

Similar threads