[HELP] Continuous multiple drive faults

bferrell · May 28, 2020

I have several servers, 4 of which are FreeNAS boxes. 3 of those perform great, but I have a new R720XD LFF that is causing me nothing but grief. I'm not a Linux or FreeNAS expert by any means, but I have some experience.

Here's the backstory. A few weeks ago I bought this new box, installed an HBA (LSI 9211-8i P20 IT Mode) and a couple of rear mount drives, installed FreeNAS, and added 12 10TB IronWolf Pro drives from a QNAP NAS that I was decommissioning. Those drives are a few years old, but had been performing great in the NAS. This box is being used as a TimeMachine store for my network.

Once I started using the system it immediately started throwing drive faults (Pool xxxx state is UNAVAIL: One or more devices are faulted in response to persistent errors.). On the Pool status page it would show some number of errors, so I would take the drive offline, replace it, it would resilver, and the pool would come back online. Then several hours later it would happen again. After the 5th or 6th drive I started to get suspicous the drives weren't all failing simutaneously and ordered a new HBA (same type), new cables, and a replacement backplane.

Over a few days I swapped all 3 out, and the new backplane seemed to help for a couple of days, but then the errors started to recurr. At this point I decided that perhaps the drives were actually just all reaching end of life and started to replace them as such. Until today.

Now the box is telling me that one of my brand new drives has faults. I understand that this is not impossible, but after replacing 8 of the old drives I really don't believe the faults, but I'm not sure what else to check. Attatched dmesg files, but the error doesn't tell me much.

Code:

CPU: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz (2900.07-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features3=0x9c000400<MD_CLEAR,IBPB,STIBP,L1DFL,SSBD>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
nfsd: can't register svc name
        (pass4:mps0:0:19:0): LOG SENSE. CDB: 4d 00 0d 00 00 00 00 00 40 00 length 64 SMID 743 Aborting command 0xfffffe000219df30
mps0: Sending reset from mpssas_send_abort for target ID 19
        (da3:mps0:0:19:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 90 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 19
(da3:mps0:0:19:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:19:0): CAM status: CCB request completed with an error
(da3:mps0:0:19:0): Retrying command
        (da6:mps0:0:24:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1054 Aborting command 0xfffffe00021b7760
mps0: Sending reset from mpssas_send_abort for target ID 24
mps0: Unfreezing devq for target ID 24
(da6:mps0:0:24:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps0:0:24:0): CAM status: Command timeout
(da6:mps0:0:24:0): Retrying command
(da6:mps0:0:24:0): SYNCHRONIZE CACHE(10)

Samuel Tai · May 28, 2020

This is quite a hefty box, and has a correspondingly large need for clean power. Check your power supplies. They may not be supplying enough power to all parts of your system, leading to voltage sags that are manifesting as these drive dropouts.

bferrell · May 28, 2020

Sam - Thanks, the data center has it's own dedicated power with UPS backup and it has dual 1,100W supplies connected to different CyberPower PR1500LCDRT2U UPS's, each on a separate 20A circuit, so I don't think it's power related. But if it were, is there a good test for that?

Samuel Tai · May 28, 2020

Look in iDRAC for power supply status and other system status.

bferrell · May 28, 2020

Yea, I don't see anything in there about power, and all of the UPS are fed from amp-reporting PDUs, all of which are under 5 amps (and each is a 20A circuit).

HoneyBadger · May 28, 2020

Take a look at these threads about the SYNCHRONIZE CACHE errors in the 10TB IronWolf - you may be experiencing a similar issue.

SYNCHRONIZE CACHE command timeout error

Sadly, this fix does not appear to work. I lost a drive today with NCQ disabled. # camcontrol tag da6 -v (pass6:mpr0:0:6:0): dev_openings 1 (pass6:mpr0:0:6:0): dev_active 0 (pass6:mpr0:0:6:0): allocated 0 (pass6:mpr0:0:6:0): queued 0 (pass6:mpr0:0:6:0): held 0...

www.ixsystems.com

Seagate IronWolf 10TB (ST10000VN0004) vs LSI IT firmware controllers

I've been looking for the right place to post this and I believe this is it. To be brief, I've been working on a YouTube series where I'm building a 100TB ZFS based server. Not all Enterprise grade hardware (it's for at home) and not using FreeBSD/FreeNAS (I have some different requirements)...

www.ixsystems.com

The fix appears to be in an updated firmware based on the results from 10TB/8TB IronWolf users. Can you do a smartctl -x on your drive to check its current firmware level and see if there is a newer release?

bferrell · May 28, 2020

Looks like there is a EN02 available. Can I apply that from the UI? Thanks for the tip, BTW, does look like my issue, which at least makes me feel less crazy.

Code:

root@freenas[/dev]# smartctl -x /dev/da6
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST10000NE0004-1ZF101
Serial Number:    ZA28F3ZT
LU WWN Device Id: 5 000c50 0b2516bbd
Firmware Version: EN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu May 28 12:07:41 2020 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Disabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

HoneyBadger · May 28, 2020

bferrell said:
Device Model: ST10000NE0004-1ZF101
...
Firmware Version: EN01

EN02 is available here and should hopefully help the issues, flagged as "Important"

https://apps1.seagate.com/downloads/certificate.html?key=1308638701702

HoneyBadger · May 28, 2020

bferrell said:
Can I apply [the new firmware] from the UI?

I don't believe so; you might need to create a bootable USB/ISO for this. I'm just hoping it's not a "Windows-Only" executable or some silliness. Disclaimer; I don't have any IronWolfs (IronWolves?) here, so I can't test out the process. I assume it's non-destructive, but especially in the case of hard drive firmware updates - make sure you have backups, and maybe update only a drive or two at once?

bferrell · May 28, 2020

Yea, I've got a pile of the old drives that I'd like to get back into the case, so I think I'll get a external caddy and update them offline and them put them back in. I'm also trying the other camcontrol fix as well. Really appreciate the pointers.

HoneyBadger · May 28, 2020

You will likely want to have the drives connected via SATA or SAS, so be aware if your external caddy is USB that the firmware might refuse to apply. If you have an eSATA port that's fine, otherwise I would say an internal solution is best even if it means grabbing another machine to use on a temporary basis.

Important Announcement for the TrueNAS Community.

[HELP] Continuous multiple drive faults

bferrell

Dabbler

Attachments

Samuel Tai

Never underestimate your own stupidity

bferrell

Dabbler

Samuel Tai

Never underestimate your own stupidity

bferrell

Dabbler

HoneyBadger

actually does care

SYNCHRONIZE CACHE command timeout error

Seagate IronWolf 10TB (ST10000VN0004) vs LSI IT firmware controllers

bferrell

Dabbler

HoneyBadger

actually does care

HoneyBadger

actually does care

bferrell

Dabbler

HoneyBadger

actually does care

Similar threads

Important Announcement for the TrueNAS Community.

[HELP] Continuous multiple drive faults

Dabbler

Attachments

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

actually does care

Dabbler

actually does care

actually does care

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "[HELP] Continuous multiple drive faults"

Similar threads