Sudden pool failure

Lebesgue · May 29, 2019

Hi everyone,
returning from few days working trip I suddenly find one of my pools offline.
zpool status for the pool yields:
pool: ESXIVOL
state: UNAVAIL
status: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to
continue functioning.
action: Destroy and re-create the pool from a backup source. Manually marking the device
repaired using 'zpool clear' may allow some data to be recovered.
scan: resilvered 34.6G in 0 days 00:19:57 with 0 errors on Mon May 27 08:17:09 2019
config:

NAME STATE READ WRITE CKSUM
ESXIVOL UNAVAIL 0 22 0
raidz1-0 DEGRADED 0 0 0
gptid/04d61697-53d7-11e9-bdc1-5065f366e21a ONLINE 0 0 0
gptid/c1f6ae17-7ccc-11e9-95f2-5065f366e21a ONLINE 0 0 0
gptid/c779e732-682c-11e9-ace4-5065f366e21a FAULTED 3 188 0 too many errors
gptid/9a22345f-683d-11e9-b8ac-5065f366e21a ONLINE 0 0 0
raidz1-1 UNAVAIL 0 44 0
gptid/8e505027-7ccd-11e9-95f2-5065f366e21a ONLINE 0 0 0
gptid/3cd86ea5-6829-11e9-ace4-5065f366e21a FAULTED 3 144 0 too many errors
gptid/c7b0545f-67fd-11e9-ace4-5065f366e21a FAULTED 3 60 0 too many errors
gptid/4dea18b1-6f03-11e9-b9c9-5065f366e21a FAULTED 0 4 0 too many errors
raidz1-2 DEGRADED 0 0 0
gptid/dd2cd7a2-803b-11e9-95f2-5065f366e21a ONLINE 0 0 0
gptid/26137960-7ccd-11e9-95f2-5065f366e21a ONLINE 0 0 0
gptid/a5814bbe-680f-11e9-ace4-5065f366e21a ONLINE 0 0 0
gptid/fba7b20e-6838-11e9-b8ac-5065f366e21a FAULTED 3 119 0 too many errors
raidz1-3 DEGRADED 0 0 0
gptid/31f55316-6836-11e9-b8ac-5065f366e21a ONLINE 0 0 0
gptid/13124900-680c-11e9-ace4-5065f366e21a ONLINE 0 0 0
gptid/9a038ea8-6801-11e9-ace4-5065f366e21a FAULTED 3 20 0 too many errors
gptid/fe82980a-549e-11e9-bdc1-5065f366e21a ONLINE 0 0 0

errors: 2413 data errors, use '-v' for a list

However SMART for the individual disks looks ok e.g. for the 3 failed drives in same vdev:

=== START OF INFORMATION SECTION ===
Vendor: SmrtStor
Product: DOPA0920S5xnNMRI
Revision: 3P00
Compliance: SPC-4
User Capacity: 915,954,950,144 bytes [915 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x500117310020b3ac
Serial number: FG007T45
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed May 29 23:22:30 2019 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature: 29 C
Drive Trip Temperature: 70 C

Manufactured in week 27 of year 2013
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 146
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 9900.620 0
write: 0 0 0 0 0 11396.555 0

Non-medium error count: 0

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Default Completed - 3592 - [- - -]

Long (extended) Self Test duration: 2880 seconds [48.0 minutes]

=== START OF INFORMATION SECTION ===
Vendor: SmrtStor
Product: DOPA0920S5xnNMRI
Revision: 3P00
Compliance: SPC-4
User Capacity: 915,954,950,144 bytes [915 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x500117310020b4f4
Serial number: FG007T6M
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed May 29 23:23:08 2019 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature: 30 C
Drive Trip Temperature: 70 C

Manufactured in week 27 of year 2013
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 125
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 3410.039 0
write: 0 0 0 0 0 9640.794 0

Non-medium error count: 0

No self-tests have been logged

=== START OF INFORMATION SECTION ===
Vendor: SmrtStor
Product: DOPA0920S5xnNMRI
Revision: 3P00
Compliance: SPC-4
User Capacity: 915,954,950,144 bytes [915 GB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x500117310020b6b8
Serial number: FG007TA2
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed May 29 23:23:31 2019 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature: 30 C
Drive Trip Temperature: 70 C

Manufactured in week 27 of year 2013
Specified cycle count over device lifetime: 0
Accumulated start-stop cycles: 117
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 3437.090 0
write: 0 0 0 0 0 9462.567 0

Non-medium error count: 0

No self-tests have been logged

FreeNAS message log is full of errors as this:
May 29 02:50:50 freenas (da1:mps0:0:33:0): WRITE(10). CDB: 2a 00 6a a1 92 70 00 00 08 00
May 29 02:50:50 freenas (da1:mps0:0:33:0): CAM status: SCSI Status Error
May 29 02:50:50 freenas (da1:mps0:0:33:0): SCSI status: Check Condition
May 29 02:50:50 freenas (da1:mps0:0:33:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
May 29 02:50:50 freenas (da1:mps0:0:33:0): Field Replaceable Unit: 24
May 29 02:50:50 freenas (da1:mps0:0:33:0): Actual Retry Count: 0
May 29 02:50:50 freenas (da1:mps0:0:33:0): Retrying command (per sense data)
May 29 02:50:50 freenas (da1:mps0:0:33:0): WRITE(10). CDB: 2a 00 00 40 02 70 00 00 08 00
May 29 02:50:50 freenas (da1:mps0:0:33:0): CAM status: SCSI Status Error
May 29 02:50:50 freenas (da1:mps0:0:33:0): SCSI status: Check Condition
May 29 02:50:50 freenas (da1:mps0:0:33:0): SCSI sense: MEDIUM ERROR asc:31,0 (Medium format corrupted)
May 29 02:50:50 freenas (da1:mps0:0:33:0): Field Replaceable Unit: 24
May 29 02:50:50 freenas (da1:mps0:0:33:0): Actual Retry Count: 0

I have tried to look up Field Replaceable Unit 24 but in vain?
Is anyone able to provide insight into this including how/which components to troubleshoot?

Regards,
Thomas

myoung · May 29, 2019

Looks like those SSDs have never had any SMART tests run. I would start with a SMART long selftest and make sure that tests are scheduled for all drive including anyother pools you have. Were you monitoring close enough to see if all 6 drives failed at once could they have been slowly failing one at a time? How are they connected?

Lebesgue · May 29, 2019

Thanks for replying.
These are SAS SSD drives and do not support short and long test:
smartctl -t short /dev/da4
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Short offline self test failed [unsupported field in scsi command]

I have scheduled daily offline test for this SSD based pool but are running daily short and monthly long tests on my SATA HDD based pool, which has been rock stable for several years.

All drives failed within one hour - this pool is being used to host VMware VM's for which I receive hourly notifications which suddenly stopped.

The SAS SSD based pool is connected to 16 port internal HBA, and the SATA HDD based pool an 16 port external HBA:

Adapter Selected is a LSI SAS: SAS2116_1(B1)

Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr
----------------------------------------------------------------------------

0 SAS2116_1(B1) 20.00.07.00 14.01.00.06 07.39.02.00 00:0d:00:00
1 SAS2116_1(B1) 20.00.07.00 14.01.00.07 07.39.02.00 00:21:00:00

Finished Processing Commands Successfully.

Both HBA's were replaced some months ago in an attempt to prevent similar error from reoccurring :(

myoung · May 30, 2019

That's outside my wheelhouse.

Important Announcement for the TrueNAS Community.

Sudden pool failure

Lebesgue

Dabbler

myoung

Explorer

Lebesgue

Dabbler

myoung

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Sudden pool failure

Lebesgue

Dabbler

myoung

Explorer

Lebesgue

Dabbler

myoung

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Sudden pool failure"

Similar threads