Can not find errors on failed drives?

Status
Not open for further replies.

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
From time to time, a disk in one of my FreeNAS boxes fail as expected. I always swap in new drives. Everything works as expected.

Once I get the failed drive out of the FreeNAS machine, I'd like to confirm that the drive is bad. To do this, I run badblocks to try to document the inadequacies of the disk.

More often than not, several passes with badblocks reveal no errors. SMART reports no realloxcated sectors etc on the drive. When that happens, I do not know if I should trust the drive or not. Any suggestions? Should I assume other hardware errors in my FreeNAS boxes?

I run badblocks like this:
sudo badblocks -p 2 -b 4096 -wsv /dev/sdb

Most of my disks are WD60EFRX's.

Any thought appreciated.

Thanks,
Tobias
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
My question is universal, hence no description of the system.

To rephrase:
What do you do when drives fail in FreeNAS, but you are not able find problems with the drive when testing it on another machine?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
My question is universal, hence no description of the system.
Despite what you think, the question is not universal, hence the questions about your system. At a minimum, what, exactly, happens when "drives fail in FreeNAS"? Different modes of apparent failure dictate different avenues of troubleshooting.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Fair enough...

A pool is degraded. An example is seen below (Can's seem to mark it up with bbcode?).

I replace the disk and complete resilvering.

I examine the drive that I removed from FreeNAS, but can't find any errors with it. Do I trust the drive?




Example of degraded pool with (what I call) a failed drive:


pool: storage
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://illumos.org/msg/ZFS-8000-2Q
scan: scrub repaired 0 in 49h40m with 0 errors on Sun Jun 12 17:40:45 2016
config:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/6ae54538-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6bb22eb1-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6c7f20b3-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6d4be3db-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6e1baabf-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/6eedbaff-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
gptid/0afe4a34-a55b-11e4-8940-002590e6d5ba ONLINE 0 0 0
gptid/709f2974-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
gptid/f11a725b-fd36-11e4-9f47-002590e6d5ba ONLINE 0 0 0
11261189594784602852 UNAVAIL 159 214 0 was /dev/gptid/723b3d87-6e48-11e4-9fcd-003048fae030
gptid/2ec7ecc5-ce46-11e5-a38b-002590e6d5ba ONLINE 0 0 0
gptid/73d9e562-6e48-11e4-9fcd-003048fae030 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
gptid/394d2b7e-054b-11e6-a132-002590e6d5ba ONLINE 0 0 0
gptid/f4198cbc-7704-11e4-a481-003048fae030 ONLINE 0 0 0
gptid/6408052a-f42f-11e4-9f47-002590e6d5ba ONLINE 0 0 0
gptid/f5c7edf4-7704-11e4-a481-003048fae030 ONLINE 0 0 0
gptid/f69f5b66-7704-11e4-a481-003048fae030 ONLINE 0 0 0
gptid/5f1bea53-a604-11e4-8940-002590e6d5ba ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
gptid/af6e2f74-a570-11e4-8940-002590e6d5ba ONLINE 0 0 0
gptid/50f707d9-7707-11e4-992f-003048fae030 ONLINE 0 0 0
gptid/7e6b743a-2bb4-11e6-96cf-002590e6d5ba ONLINE 0 0 0
gptid/5342037c-7707-11e4-992f-003048fae030 ONLINE 0 0 0
gptid/5407a022-7707-11e4-992f-003048fae030 ONLINE 0 0 0
gptid/54b5847b-7707-11e4-992f-003048fae030 ONLINE 0 0 0
 

Mirfster

Doesn't know what he's talking about
Joined
Oct 2, 2015
Messages
3,215
My question is universal, hence no description of the system.
TBH, this info is needed. No sense in going through all the troubleshooting/suggestions to find out later that you are running something like:
  • 1/2 the minimal recommended amount of RAM
  • Not using ECC RAM
  • Using a Hardware Raid and passing drives through as individual "Raid0" devices
  • etc
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Let me rephrase my question:

When a drive has failed on FreeNAS (Or any other system with ZFS), but you can not later find anything wrong with the drive when examining it on another system, would you think:

A: That is normal, ZFS can fail drives because of reasons Z, Y and Z. Don't worry, the disk is fine. Don't believe anything ZFS tells you. Feel free to use the disk again.
B: You drive is on the edge. Luckily ZFS detected it at an early stage. Even though I can not find errors with badblocks or SMART, the disk is not to be trusted.
C: Something is probably wrong with the machine which used the disk. It could be software or hardware. I need to get to the bottom of this!
D: Ask online. Has anyone else experienced something similar?

So, what would you personally think? Have you ever experienced something like it (Could not confirm problems with failed drive)?

Hardware:
64GB RAM
EEC RAM
2 x IBM ServeRAID M1015 with LSI9211-IT (Pass through) firmware.

Thanks
 

Sakuru

Guru
Joined
Nov 20, 2015
Messages
527
but you can not later find anything wrong with the drive when examining it on another system, would you think:
C
If a drive is throwing errors it generally doesn't just go UNAVAIL. Please tell us everything about your hardware, especially your chassis and backplane.
Also, please post the output of "smartctl -x /dev/drive_identifier" for the failed drives in code tags (Insert... -> Code)
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
This goes back to my post, that there are lots of flavors of apparent failures. What's most common, at least in my experience, is SMART failures--failed tests, bad blocks, etc. Those are in the SMART log, so there's no real issue with reproducibility. In the example you posted, the disk got kicked offline entirely, which is a different situation. That tends to point to either a complete failure of the drive (which is clearly not the case if you plug it into a different machine and it works fine), or a problem with the communication path between the OS and the drive. My next steps in a case like that would be (1) camcontrol devlist, to see if the OS sees the disk at all, (2) if the disk wasn't seen with 'camcontrol devlist', remove and reinstall the drive, and see if it shows up, (3) try to online the missing disk.
 

tobiasbp

Patron
Joined
Dec 2, 2015
Messages
238
Thanks everyone. I will create a new thread with more technical information since the behavior I see is not to be expected.

I will post again the next time I experience a drive becoming unavailable.

Thank you for your time.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'd just add that, in any case, a new burn-in of the drive is recommendable. If the drive passes that, there's not much else to do but to keep around as "working".
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
A device going unavailable is not expected. @danb35 has a very good answer up in #9.

I see a lot of older drives occasionally develop a bad block while doing a SMART long test, and about half the time this clears up when you rewrite the block. So SMART test failures are not necessarily a reliable indicator of disk failure.
 

diehard

Contributor
Joined
Mar 21, 2013
Messages
162
This used to be fairly common with some of the SAS backplane to SATA interposers... i think it has gotten a lot better recently with updated LSI firmware/drivers though.
 
Status
Not open for further replies.
Top