Tracking weird issue with Disks

Py7h0n · Nov 20, 2016

Had a system running for a while now (started life with 2tb disks and 9.01) - Last upgrade took it to:

U-NAS 800 case
X10SDV-4C-TLN2F board
64GB ECC Ram
LSI 9211 flashed with latest IT firmware
6TB Seagate NAS drives

Had no issues what so ever - Stable as could be!

Upgraded the 6TB disks to 10TB Seagate IronWolf disks and ever since I have been getting disks being marked as failed. Swap them with spares and the issue comes back on another disk / another slot. Randomly it is Write / Read and every now and then a Checksum.

[root@ZFS] ~# zpool status
pool: DATA1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Nov 21 15:48:30 2016
1.05T scanned out of 26.3T at 419M/s, 20h35m to go
131G resilvered, 3.44% done
config:

NAME STATE READ WRITE CKSUM
DATA1 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/b581e923-902e-11e6-b082-0cc47ac34350 FAULTED 9 279 0 too many errors
gptid/7c107e59-8918-11e6-a4f4-0cc47ac34350 ONLINE 0 0 0
gptid/564188b4-8f7a-11e6-b082-0cc47ac34350 ONLINE 0 0 0
gptid/6e411002-adf7-11e6-ad6a-0cc47ac34350 ONLINE 0 0 0
gptid/f3218024-af94-11e6-ad6a-0cc47ac34350 ONLINE 0 0 0 (resilvering)
gptid/db56fee2-8ea8-11e6-88a3-0cc47ac34350 ONLINE 0 0 0
gptid/c1f58b6f-8c4e-11e6-88a3-0cc47ac34350 ONLINE 0 0 0
gptid/3064b301-ad3b-11e6-ad6a-0cc47ac34350 ONLINE 0 0 0

errors: No known data errors

I've replaced SAS cables
I've replaced disks (I've made sure they are PMR, not SMR like the 8TB Archive disks)
I've swapped the LSI card to a 9300-8i along with new cables again
Smarts always come back good (quick or extended checks).
Disks always work and check out 100% in other systems (non zfs).

The only thing I have not changed is the chassis / disk backplane. Do you guys think this is the problem? Been chasing this issue for almost three weeks not and it is really annoying! I really dont think it is the disks but then again I cannot find anyone else who is running these disks with ZFS.

Any direction would be appreciated!

(Oh and I do live in New Zealand but far north where we have had no earthquakes... So not physical issues either ;) )

Pitfrr · Nov 21, 2016

Well I don't have much experience with that kind of system so I can't say anything about the chassis... but what about the SMART data of the disks? Anything suspicious there? Eventhough you're confident it's not the disks, it's worth checking out.

SweetAndLow · Nov 21, 2016

What is your power supply? They could be doing it because there isn't enough power.

Sent from my Nexus 5X using Tapatalk

Stux · Nov 21, 2016

Those disks might need a bigger PSU.

Can probably verify by using a normal big PSU with the case off.

Py7h0n · Nov 21, 2016

Thanks All.

Smarts are all good for sure (checked on 3 different systems with different tools).

PSU is a 400w unit purchased from u-nas. I will put a standard big 600w in front of it and see what happens - Honestly did not think about checking the PSU so thank you for the idea!

brando56894 · Nov 21, 2016

Might as well check the backplane to have everything covered.

gimpbully · Apr 14, 2017

Py7h0n said:
Thanks All.

Smarts are all good for sure (checked on 3 different systems with different tools).

PSU is a 400w unit purchased from u-nas. I will put a standard big 600w in front of it and see what happens - Honestly did not think about checking the PSU so thank you for the idea!

Did this end up being a power supply issue? I've started having the same strange issues with ironwolf 10TB with a case strikingly similar to that unas one.

Py7h0n · Apr 14, 2017

No.

I swapped boards, power supplies and cases without any success.

Upgrading to version 10 fixed it for me.

I now have 8 spare 10tb disks :p

gimpbully said:
Did this end up being a power supply issue? I've started having the same strange issues with ironwolf 10TB with a case strikingly similar to that unas one.

Ericloewe · Apr 14, 2017

You mean Corral?
https://forums.freenas.org/index.php?threads/important-announcement-regarding-freenas-corral.53502/

Py7h0n · Apr 14, 2017

Yes - First Corral release fixed the issue. Now running 10.0.4 and still no issues with the disks!

Ericloewe said:
You mean Corral?
https://forums.freenas.org/index.php?threads/important-announcement-regarding-freenas-corral.53502/

Ericloewe · Apr 14, 2017

Unfortunately, that's not going to be viable long-term. Good news is that 9.10.3 should behave the same way, since it's also FreeBSD 11.

Py7h0n · Apr 14, 2017

What is not viable long term? Running Corrral?

Ericloewe · Apr 14, 2017

Py7h0n said:
What is not viable long term? Running Corrral?

Yeah.

Py7h0n · Apr 14, 2017

Well I did run version 7 for a VERY long time before upgrading so I guess Corral could be the same :P.

It does 'just work' for the 10TB disks without any issues. I spent a VERY long time with the frustrating issue in this thread and it almost pushed me to Native FreeBSD so I am rather happy with Corral (Supported or not :P ).

Ericloewe · Apr 14, 2017

It's a low-level issue, so 9.10.3 is sure to solve it the same way Corral did.

Important Announcement for the TrueNAS Community.

Tracking weird issue with Disks

Py7h0n

Cadet

Pitfrr

Wizard

SweetAndLow

Sweet'NASty

Stux

MVP

Py7h0n

Cadet

brando56894

Wizard

gimpbully

Cadet

Py7h0n

Cadet

Ericloewe

Server Wrangler

Py7h0n

Cadet

Ericloewe

Server Wrangler

Py7h0n

Cadet

Ericloewe

Server Wrangler

Py7h0n

Cadet

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Tracking weird issue with Disks

Cadet

Wizard

Sweet'NASty

MVP

Cadet

Wizard

Cadet

Cadet

Server Wrangler

Cadet

Server Wrangler

Cadet

Server Wrangler

Cadet

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Tracking weird issue with Disks"

Similar threads