Disc Degraded

Peque · Jan 5, 2021

Hi Forum
I've been using freenas at home - and therefor decided to use truenas as storage at work

It have been running for 2 month now ( A Brand new server) with 12 new disc with RaidZ2 ! All disc are TOSHIBA MG07SCA12TE

Suddenkly this monday it started with sending mails that 7 devices on the server is degraded along with the pool.
An next morning the rest of the Disc was degraded. I have an issue believing that 12New Disc are broken down at once.

I've tried taken the Disc offline og forcing a replace with out any luck. Does this sound like its reliable ? Should I change 12Disc after 2months of running?

How to proceed when this is happening ? I cannot hink that all 12disc - including controldisc are dead at once ?
And cannot see how to get this up and running again without should change all 12 disc

sretalla · Jan 5, 2021

What do you see from zpool status -v ?

You would want to make sure that the disks are bad and not the controller or cabling, so consider looking at your SMART data: smartctl -a /dev/daX

Peque · Jan 6, 2021

I'm getting the following from zpool status -v

Code:

# zpool status -v
  pool: data
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jan  5 17:01:20 2021
        19.1T scanned at 332M/s, 16.4T issued at 286M/s, 19.2T total
        2.61T resilvered, 85.71% done, 02:47:15 to go
config:

        NAME                                                STATE     READ WRITE CKSUM
        data                                                DEGRADED     0     0     0
          raidz2-0                                          DEGRADED     0     0     0
            gptid/7fadfa1b-4ddc-11eb-825b-3cecef0d2eca.eli  ONLINE       0     0    29
            gptid/205bc94a-4e5f-11eb-825b-3cecef0d2eca.eli  ONLINE       0     0    29
            gptid/815a99c0-4e5f-11eb-825b-3cecef0d2eca.eli  ONLINE       0     0    58  (resilvering)
            gptid/48c37ee5-4f16-11eb-825b-3cecef0d2eca.eli  ONLINE       0     0    58  (resilvering)
            gptid/2c2aaf7d-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/2f534ecc-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/316b340c-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/300ec1e1-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/31ac779d-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/321c95e4-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/33b2adf1-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors
            gptid/3119d081-f98c-11ea-bd70-3cecef0d2eca.eli  DEGRADED     0     0    29  too many errors

errors: Permanent errors have been detected in the following files:

        /mnt/data/veeam/Udvikling/XXXX-MYSQL-SRV01.10D2021-01-02T220032_5130.vbk

  pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:00:08 with 0 errors on Fri Jan  1 03:45:08 2021
config:

        NAME          STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada0p2    ONLINE       0     0     0
            ada1p2    ONLINE       0     0     0

errors: No known data errors

This is taken from ´/dev/da6 - which is one of the degraded discs

Code:

# smartctl -a /dev/da6
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG07SCA12TE
Revision:             0101
Compliance:           SPC-4
User Capacity:        12,000,138,625,024 bytes [12.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000039a384a783d
Serial number:        60H0A07SFJ8G
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Jan  6 09:46:49 2021 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     34 C
Drive Trip Temperature:        65 C

Manufactured in week 25 of year 2020
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  9
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  9
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     211669.679           0
write:         0        0         0         0          0      14052.920           0
verify:        0        0         0         0          0          0.009           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    2922                 - [-   -    -]
# 2  Background short  Completed                   -    2754                 - [-   -    -]
# 3  Background short  Completed                   -    2586                 - [-   -    -]
# 4  Background short  Completed                   -    2418                 - [-   -    -]
# 5  Background short  Completed                   -    2250                 - [-   -    -]
# 6  Background short  Completed                   -    2082                 - [-   -    -]
# 7  Background short  Completed                   -    1914                 - [-   -    -]
# 8  Background short  Completed                   -    1746                 - [-   -    -]
# 9  Background short  Completed                   -    1578                 - [-   -    -]
#10  Background short  Completed                   -    1410                 - [-   -    -]
#11  Background short  Completed                   -    1242                 - [-   -    -]
#12  Background short  Completed                   -    1073                 - [-   -    -]
#13  Background short  Completed                   -     905                 - [-   -    -]
#14  Background short  Completed                   -     737                 - [-   -    -]
#15  Background short  Completed                   -     569                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Since all this is quite new for me - But the only error I can see from these commands are this:

errors: Permanent errors have been detected in the following files:

/mnt/data/veeam/Udvikling/XXXX-MYSQL-SRV01.10D2021-01-02T220032_5130.vbk

Etorix · Jan 6, 2021

Details of your configuration are missing, in particular the motherboard and controller(s). Is it TrueNAS on bare metal or virtualised?

The eight drives with "too many errors" look suspiciously like bad cables or a bad controller. Do you have other cables/controllers to try?

Peque · Jan 6, 2021

Hi and thanks for the reply.
Describing our server
The hardware is:
SUPERMICRO SuperStorage Server SSG-5029P-E1CTR12L with TrueNAS-12.0-STABLE installed on 2x250GB SSD Disk in RAID1 on bare metal
The data storage is 12 x 12TB Toshiba disc in RAIDz2

I do not have an extra at the moment - but I'll trying to get a new set of boths!

ChrisRJ · Jan 7, 2021

Peque said:
[..] How to proceed when this is happening ? I cannot hink that all 12disc - including controldisc are dead at once ?
And cannot see how to get this up and running again without should change all 12 disc

The probability of 12 disks going bad in such a short time frame is close to zero. It is much more likely that something is wrong with the HBA or similar.

Etorix · Jan 7, 2021

While waiting for the new parts, you may check that cables are well seated and/or initiate long SMART tests on the drives (will take overnight, and some more).

Peque · Jan 8, 2021

Hi Guys.
In these days - nothing is easy in this Corona world ...
I've talked with Supermicro support and run their diagnostic - and the answer was:

Ofcourse the simultaneous failure of 12 disks is indeed impossible
All your used harddrives are listed as compatible/tested on your SSG-5029P-E1CTR12L system so also a disk compatibility issue can be ruled out
After examining the provided controller logs also no errors are found
This suggests there are no hardware related errors at all

Currently the controller is configured as HBA (NON-RAID) controller (IT Firmware) and the RAID is controlled and created by the OS itself (which currently appears to have issues)

The next question for me is then:

Which was is best with the storage? onboard RaidController Raid10 - or using truenas RaidZ2 ?
Could this be after upgrading from freenas and upgrade the disc encryption ?

worst case I'll go back to freenas and reinstall everything and delete my datapools etc and create a new storage pool - SInce this is primary storage pool on a brand new server this ocours strange to me

ChrisRJ · Jan 8, 2021

It is just like with car diagnostics these days: The fact that a vendor tells you "nothing wrong" only means that their program has not identified a situation that was pre-conceived by the program's author. In turn, this also means that either a completely separate hardware failure can exist. Or it even means that the program failed to properly detect a situation it should be able to diagnose.

In other words: Such programs can be helpful, but following them blindly is not very clever. The fact that your system basically died after 2 months is a very strong indicator that you have indeed a hardware failure. It is still more or less a DOA (dead on arrival) which is not that uncommon. To me it looks like a first-level support guy that, for whatever reason, tried to deflect the issue.

As to next steps: A "RAID controller" that sits on the motherboard is not a proper RAID controller in most cases. With Supermicro your chances to have something reasonable is probably above average. But without additional details I would always strongly suggest to avoid such things. But what you can do is

go back to Supermicro and insist on proper diagnostics, an on-site technician, or whatever your contract entitles you to
connect the disks directly to the MB without going via the backplane to narrow down the issue
re-install TrueNAS or FreeNAS (I would recommend the latter for the time being) and hope that it solves the issue

Peque · Jan 9, 2021

Since this HBA in embedded onboard - Which controller could replace this onboard - supported by BSD and truenas - that can handle all 12 discs.
I did not specify a special controller when ying the server - since I though using Raidz2 in using the HBA controller.
But this controller is onboard and cannot be replaced without the MB is replaced - so thinking about getting another controller to runs as HBA insted on the onboard controller.

Supermicro is pretty hard to get through since they only takes point from the collected logsfiles etc as described above.

So thinking about trying a PCI instead of the onboard controller

Redcoat · Jan 9, 2021

Here's SM's information on the on-board controllers (from the spec sheet):

Have you read the postings here on the Broadcom 3008? There are a lot of positive postings - perhaps there's something helpful to you in them. The embedded version is mentioned.

Important Announcement for the TrueNAS Community.

Disc Degraded

Peque

Dabbler

sretalla

Powered by Neutrality

Peque

Dabbler

Etorix

Wizard

Peque

Dabbler

ChrisRJ

Wizard

Etorix

Wizard

Peque

Dabbler

ChrisRJ

Wizard

Peque

Dabbler

Redcoat

MVP

Similar threads