[HELP] Continuous multiple drive faults

bferrell

Dabbler
Joined
Dec 10, 2018
Messages
15
I have several servers, 4 of which are FreeNAS boxes. 3 of those perform great, but I have a new R720XD LFF that is causing me nothing but grief. I'm not a Linux or FreeNAS expert by any means, but I have some experience.

Here's the backstory. A few weeks ago I bought this new box, installed an HBA (LSI 9211-8i P20 IT Mode) and a couple of rear mount drives, installed FreeNAS, and added 12 10TB IronWolf Pro drives from a QNAP NAS that I was decommissioning. Those drives are a few years old, but had been performing great in the NAS. This box is being used as a TimeMachine store for my network.

Once I started using the system it immediately started throwing drive faults (Pool xxxx state is UNAVAIL: One or more devices are faulted in response to persistent errors.). On the Pool status page it would show some number of errors, so I would take the drive offline, replace it, it would resilver, and the pool would come back online. Then several hours later it would happen again. After the 5th or 6th drive I started to get suspicous the drives weren't all failing simutaneously and ordered a new HBA (same type), new cables, and a replacement backplane.

Over a few days I swapped all 3 out, and the new backplane seemed to help for a couple of days, but then the errors started to recurr. At this point I decided that perhaps the drives were actually just all reaching end of life and started to replace them as such. Until today.

Now the box is telling me that one of my brand new drives has faults. I understand that this is not impossible, but after replacing 8 of the old drives I really don't believe the faults, but I'm not sure what else to check. Attatched dmesg files, but the error doesn't tell me much.

Code:
CPU: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz (2900.07-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features3=0x9c000400<MD_CLEAR,IBPB,STIBP,L1DFL,SSBD>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
nfsd: can't register svc name
        (pass4:mps0:0:19:0): LOG SENSE. CDB: 4d 00 0d 00 00 00 00 00 40 00 length 64 SMID 743 Aborting command 0xfffffe000219df30
mps0: Sending reset from mpssas_send_abort for target ID 19
        (da3:mps0:0:19:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 90 terminated ioc 804b loginfo 31140000 scsi 0 state c xfer 0
mps0: Unfreezing devq for target ID 19
(da3:mps0:0:19:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mps0:0:19:0): CAM status: CCB request completed with an error
(da3:mps0:0:19:0): Retrying command
        (da6:mps0:0:24:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 1054 Aborting command 0xfffffe00021b7760
mps0: Sending reset from mpssas_send_abort for target ID 24
mps0: Unfreezing devq for target ID 24
(da6:mps0:0:24:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps0:0:24:0): CAM status: Command timeout
(da6:mps0:0:24:0): Retrying command
(da6:mps0:0:24:0): SYNCHRONIZE CACHE(10)
 

Attachments

  • dmesg.txt
    31.3 KB · Views: 242

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
This is quite a hefty box, and has a correspondingly large need for clean power. Check your power supplies. They may not be supplying enough power to all parts of your system, leading to voltage sags that are manifesting as these drive dropouts.
 

bferrell

Dabbler
Joined
Dec 10, 2018
Messages
15
Sam - Thanks, the data center has it's own dedicated power with UPS backup and it has dual 1,100W supplies connected to different CyberPower PR1500LCDRT2U UPS's, each on a separate 20A circuit, so I don't think it's power related. But if it were, is there a good test for that?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Look in iDRAC for power supply status and other system status.
 

bferrell

Dabbler
Joined
Dec 10, 2018
Messages
15
Yea, I don't see anything in there about power, and all of the UPS are fed from amp-reporting PDUs, all of which are under 5 amps (and each is a 20A circuit).
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Take a look at these threads about the SYNCHRONIZE CACHE errors in the 10TB IronWolf - you may be experiencing a similar issue.


The fix appears to be in an updated firmware based on the results from 10TB/8TB IronWolf users. Can you do a smartctl -x on your drive to check its current firmware level and see if there is a newer release?
 

bferrell

Dabbler
Joined
Dec 10, 2018
Messages
15
Looks like there is a EN02 available. Can I apply that from the UI? Thanks for the tip, BTW, does look like my issue, which at least makes me feel less crazy.

Code:
root@freenas[/dev]# smartctl -x /dev/da6
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST10000NE0004-1ZF101
Serial Number:    ZA28F3ZT
LU WWN Device Id: 5 000c50 0b2516bbd
Firmware Version: EN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu May 28 12:07:41 2020 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Disabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
 
Last edited:

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Can I apply [the new firmware] from the UI?

I don't believe so; you might need to create a bootable USB/ISO for this. I'm just hoping it's not a "Windows-Only" executable or some silliness. Disclaimer; I don't have any IronWolfs (IronWolves?) here, so I can't test out the process. I assume it's non-destructive, but especially in the case of hard drive firmware updates - make sure you have backups, and maybe update only a drive or two at once?
 

bferrell

Dabbler
Joined
Dec 10, 2018
Messages
15
Yea, I've got a pile of the old drives that I'd like to get back into the case, so I think I'll get a external caddy and update them offline and them put them back in. I'm also trying the other camcontrol fix as well. Really appreciate the pointers.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
You will likely want to have the drives connected via SATA or SAS, so be aware if your external caddy is USB that the firmware might refuse to apply. If you have an eSATA port that's fine, otherwise I would say an internal solution is best even if it means grabbing another machine to use on a temporary basis.
 
Top