Can not find errors on failed drives?

tobiasbp · Jun 17, 2016

From time to time, a disk in one of my FreeNAS boxes fail as expected. I always swap in new drives. Everything works as expected.

Once I get the failed drive out of the FreeNAS machine, I'd like to confirm that the drive is bad. To do this, I run badblocks to try to document the inadequacies of the disk.

More often than not, several passes with badblocks reveal no errors. SMART reports no realloxcated sectors etc on the drive. When that happens, I do not know if I should trust the drive or not. Any suggestions? Should I assume other hardware errors in my FreeNAS boxes?

I run badblocks like this:
sudo badblocks -p 2 -b 4096 -wsv /dev/sdb

Most of my disks are WD60EFRX's.

Any thought appreciated.

Thanks,
Tobias

Mirfster · Jun 17, 2016

What are your system specs (System Configuration Information)?

tobiasbp · Jun 17, 2016

My question is universal, hence no description of the system.

To rephrase:
What do you do when drives fail in FreeNAS, but you are not able find problems with the drive when testing it on another machine?

danb35 · Jun 17, 2016

tobiasbp said:
My question is universal, hence no description of the system.

Despite what you think, the question is not universal, hence the questions about your system. At a minimum, what, exactly, happens when "drives fail in FreeNAS"? Different modes of apparent failure dictate different avenues of troubleshooting.

tobiasbp · Jun 17, 2016

Fair enough...

A pool is degraded. An example is seen below (Can's seem to mark it up with bbcode?).

I replace the disk and complete resilvering.

I examine the drive that I removed from FreeNAS, but can't find any errors with it. Do I trust the drive?

Example of degraded pool with (what I call) a failed drive:

pool: storage
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: scrub repaired 0 in 49h40m with 0 errors on Sun Jun 12 17:40:45 2016
config:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/6ae54538-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6bb22eb1-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6c7f20b3-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6d4be3db-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6e1baabf-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6eedbaff-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
gptid/0afe4a34-a55b-11e4-8940-002590e6d5ba ONLINE 0 0 0
gptid/709f2974-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/f11a725b-fd36-11e4-9f47-002590e6d5ba ONLINE 0 0 0
11261189594784602852 UNAVAIL 159 214 0 was /dev/gptid/723b3d87-6e48-11e4-9fcd-003048fae030
gptid/2ec7ecc5-ce46-11e5-a38b-002590e6d5ba ONLINE 0 0 0
gptid/73d9e562-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
gptid/394d2b7e-054b-11e6-a132-002590e6d5ba ONLINE 0 0 0
gptid/f4198cbc-7704-11e4-a481-003048fae030 ONLINE 0 0 0
gptid/6408052a-f42f-11e4-9f47-002590e6d5ba ONLINE 0 0 0
gptid/f5c7edf4-7704-11e4-a481-003048fae030 ONLINE 0 0 0
gptid/f69f5b66-7704-11e4-a481-003048fae030 ONLINE 0 0 0
gptid/5f1bea53-a604-11e4-8940-002590e6d5ba ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
gptid/af6e2f74-a570-11e4-8940-002590e6d5ba ONLINE 0 0 0
gptid/50f707d9-7707-11e4-992f-003048fae030 ONLINE 0 0 0
gptid/7e6b743a-2bb4-11e6-96cf-002590e6d5ba ONLINE 0 0 0
gptid/5342037c-7707-11e4-992f-003048fae030 ONLINE 0 0 0
gptid/5407a022-7707-11e4-992f-003048fae030 ONLINE 0 0 0
gptid/54b5847b-7707-11e4-992f-003048fae030 ONLINE 0 0 0

Mirfster · Jun 17, 2016

tobiasbp said:
My question is universal, hence no description of the system.

TBH, this info is needed. No sense in going through all the troubleshooting/suggestions to find out later that you are running something like:

1/2 the minimal recommended amount of RAM
Not using ECC RAM
Using a Hardware Raid and passing drives through as individual "Raid0" devices
etc

tobiasbp · Jun 17, 2016

Let me rephrase my question:

When a drive has failed on FreeNAS (Or any other system with ZFS), but you can not later find anything wrong with the drive when examining it on another system, would you think:

A: That is normal, ZFS can fail drives because of reasons Z, Y and Z. Don't worry, the disk is fine. Don't believe anything ZFS tells you. Feel free to use the disk again.
B: You drive is on the edge. Luckily ZFS detected it at an early stage. Even though I can not find errors with badblocks or SMART, the disk is not to be trusted.
C: Something is probably wrong with the machine which used the disk. It could be software or hardware. I need to get to the bottom of this!
D: Ask online. Has anyone else experienced something similar?

So, what would you personally think? Have you ever experienced something like it (Could not confirm problems with failed drive)?

Hardware:
64GB RAM
EEC RAM
2 x IBM ServeRAID M1015 with LSI9211-IT (Pass through) firmware.

Thanks

Sakuru · Jun 17, 2016

tobiasbp said:
but you can not later find anything wrong with the drive when examining it on another system, would you think:

C
If a drive is throwing errors it generally doesn't just go UNAVAIL. Please tell us everything about your hardware, especially your chassis and backplane.
Also, please post the output of "smartctl -x /dev/drive_identifier" for the failed drives in code tags (Insert... -> Code)

danb35 · Jun 17, 2016

This goes back to my post, that there are lots of flavors of apparent failures. What's most common, at least in my experience, is SMART failures--failed tests, bad blocks, etc. Those are in the SMART log, so there's no real issue with reproducibility. In the example you posted, the disk got kicked offline entirely, which is a different situation. That tends to point to either a complete failure of the drive (which is clearly not the case if you plug it into a different machine and it works fine), or a problem with the communication path between the OS and the drive. My next steps in a case like that would be (1) camcontrol devlist, to see if the OS sees the disk at all, (2) if the disk wasn't seen with 'camcontrol devlist', remove and reinstall the drive, and see if it shows up, (3) try to online the missing disk.

tobiasbp · Jun 17, 2016

Thanks everyone. I will create a new thread with more technical information since the behavior I see is not to be expected.

I will post again the next time I experience a drive becoming unavailable.

Thank you for your time.

Ericloewe · Jun 19, 2016

I'd just add that, in any case, a new burn-in of the drive is recommendable. If the drive passes that, there's not much else to do but to keep around as "working".

jgreco · Jun 20, 2016

A device going unavailable is not expected. @danb35 has a very good answer up in #9.

I see a lot of older drives occasionally develop a bad block while doing a SMART long test, and about half the time this clears up when you rewrite the block. So SMART test failures are not necessarily a reliable indicator of disk failure.

diehard · Jun 20, 2016

This used to be fairly common with some of the SAS backplane to SATA interposers... i think it has gotten a lot better recently with updated LSI firmware/drivers though.

Important Announcement for the TrueNAS Community.

Can not find errors on failed drives?

tobiasbp

Patron

Mirfster

Doesn't know what he's talking about

tobiasbp

Patron

danb35

Hall of Famer

tobiasbp

Patron

Mirfster

Doesn't know what he's talking about

tobiasbp

Patron

Sakuru

Guru

danb35

Hall of Famer

tobiasbp

Patron

Ericloewe

Server Wrangler

jgreco

Resident Grinch

diehard

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Can not find errors on failed drives?

Patron

Doesn't know what he's talking about

Patron

Hall of Famer

Patron

Doesn't know what he's talking about

Patron

Guru

Hall of Famer

Patron

Server Wrangler

Resident Grinch

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Can not find errors on failed drives?"

Similar threads