Multiple disk failures confusion

superfour · Sep 9, 2022

Hi all!

My system is 12U8.1, with 1 vdev consiting of 10 disks (RAIDZ2).

Lately I had had several warnings about failing disks, and at some point a disk showed as failed. However, after a restart it went live again, and resilvered. But all the signs of failing disks are there.

Now system shows again 2 failed disks: one REMOVED and one UNAVAIL. Taking the s/n from the alert, I offline'd the removed disk, and replaced it. And now the situation is this: I still have a REMOVED and a UNAVAIL disk in my vdev list... still not sure which disk is which, as they both appear only by gptid.

How do I go about locating what's going on here? I reckon this is critical for my array, since 2 disks have already failed.

TIA!

ChrisRJ · Sep 9, 2022

You will need to provide a lot more information to get anything but a common place response

In general, you will need to find the serial number (S/N) of the disk in question and match it to the disks in your machine. If your disks do not have labels with their S/N on them, this will be a less-than-pleasant exercise.

Redcoat · Sep 9, 2022

glabel status on the CLI gets you gptid vs adaX or daX

Storage>Disks gets you adaX or daX vs Serial Number - then @ChrisRJ 's comment may play...

superfour · Sep 10, 2022

Thank you guys!!

(Although I need to say, the fail/replace procedure is somewhat confusing imho. System now is resilvering for one replaced disk, and a critical alert for a second disk came an hour after startup...)

superfour · Sep 10, 2022

And while I'm still at it (resilvering...) please tell me, what are the chances for this situation within one hour:

Redcoat · Sep 10, 2022

Let's start with you telling us all about your hardware, also as much as you can about

superfour said:
Lately I had had several warnings about failing disks, and at some point a disk showed as failed. However, after a restart it went live again, and resilvered. But all the signs of failing disks are there.

the circumstances to which you referred there.

Have you run short or long SMART tests on any of your drives - if so can you post the results.

What do you know about HDD or controller temperatures? Any power issues that you know about?

Be good if you can run/report output of zpool status -v and run/report in code tags a long smartdrv test on each of your drives (suggest you use tmux)

superfour · Sep 10, 2022

Only gets worse...

Code:

# zpool status -v
  pool: Pool1000
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep 10 21:33:11 2022
        2.00T scanned at 450M/s, 1.40T issued at 316M/s, 4.41T total
        118G resilvered, 31.82% done, 02:46:20 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        Pool1000                                        DEGRADED     0     0 0
          raidz2-0                                      DEGRADED     0     0 0
            gptid/8db9bb46-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors
            gptid/8dc85dd3-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors
            gptid/8e71765a-537c-11ec-af45-001fd0a402b8  REMOVED      0     0 0
            gptid/054365a1-3137-11ed-8d00-001fd0a402b8  ONLINE       0     0 2.33M  (resilvering)
            gptid/8f0c44a0-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors  (resilvering)
            gptid/8f11d75c-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors  (resilvering)
            gptid/8ec03bb6-537c-11ec-af45-001fd0a402b8  REMOVED      0     0 0
            gptid/8f188c7d-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors
            gptid/8f1c2cc5-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors
            gptid/8f6d7dbc-537c-11ec-af45-001fd0a402b8  DEGRADED     0     0 1.23M  too many errors

errors: List of errors unavailable: pool I/O is currently suspended

  pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:00:56 with 0 errors on Thu Aug 25 03:45:56 2022
config:

Redcoat · Sep 10, 2022

Hmmm...

Tell us about the hardware, please.

superfour · Sep 10, 2022

It is/was my first testbed:
older Gigabyte EP45-DS3R mobo
Quad Core Q9650
8GB RAM
10x 1TB disks (various models) spread over mobo SATA plus two extra SATA controllers
SSD system disk
Corsair 650W psu

Server has been working almost two years now, has had two faulted drives; successfully replaced.

Redcoat · Sep 10, 2022

Details on the SATA controllers?

Are the drive problems confined to particular controller(s)?

Ambient temperatures high? Systems "cool"?

superfour · Sep 10, 2022

All is cool, ambient 26deg C as we speak, CPU cool, all cool.
Having almost all disks failed, I don't think it's a controller issue.
I've shut it down now, the hard way as it couldn't shutdown gracefully, and will try it again tomorrow.
Extra controllers from Sweex (2-port, Sil3132) and Delock (4-port, Marvell 88SE9215).

ChrisRJ · Sep 10, 2022

superfour said:
Having almost all disks failed, I don't think it's a controller issue.

On the contrary! That is exactly the symptoms that point to the drives not being the root cause. The controller, the cables, or the PSU are more likely the reason than the drives themselves.

The ambient temperature has zero meaning for the drive temperature. The drives can easily be 20 C warmer.

The SATA controllers are certainly suspects here.

You still do not provide a lot of information. What HDDs are you running? Did anything happen or change before the disks reporting problems?

Your memory is too small for a current version of TrueNAS.

Why did you shut down the system while it was resilvering? At least that is what I understand from your writing, although strictly speaking you did not provide this information either.

I understand that you are in a bit of panic (and understandably so). But getting information feels like pulling teeth here. We want to help you! But that requires you to provide as much information as possible. Please read this as constructive feedback, driven by the wish to help.

superfour · Sep 11, 2022

Thank you Chris!

I understand that RAM is marginal, but it's a system for personal use. Also, It has been working smoothly for two years. I shut it down because all this havoc started taking place right after I replaced that one drive, and it was getting worse by the minute. [some panic here!]

What was happening before all this, was the reason of my post originally: I was getting alerts for 2 or 3 drives faulting (intermittently 2/3), and the combination of information I was seeing was confusing (which drive was UNAVAIL, REMOVED etc). I couldn't tell exaclty which drive was faulting. Even today, I started the server and shows 2 drives online (it was only 1 when I shut down).

Drives are
3x WC10EZEX
3x ST1000DM003
4x ST91000640NS

I shutdown again and removed the 2-port controller to see what happens -- nothing much. Drive list in Truenas looks the same (shouldn't it show 2 disks less in the DISKS page?) and disks are still degraded/removed except for 2. Scanning is taking place now, and by the looks of it it will take a coupld of hours.

So, next test is to remove the other controller?

Thanks again, I appreciate!

Etorix · Sep 11, 2022

Yes, if you can attach all drives to the motherboard, the next step is to remove the dubious SATA controllers.
If you can't attach all drives directly, you'd need a proper HBA, say a LSI 9200. And more RAM if possible. ZFS was developed as en enterprise solution; ZFS does scale up very well and is designed at making the most of all the money that is thrown at it, but ZFS is not so good at scaling down to small and cheap home setups.

Davvo · Sep 11, 2022

You really need to at least double your total RAM. Possibily ECC.
Mobo is old (it's using DDR2), so I can totally see a SATA controller issue. Could be time to upgrade.

Since the SSD is safe though, I believe it's pool-related. Maybe an issue linked with memory starvation? Purely guessing here.

Can you post the SMART data for the faulty drives?

Also: make sure you are not using the SATA IC in RAID mode.

superfour · Sep 11, 2022

What I *can* do is move all disks to another server (I'll need the 4-port controller too...) and try there. I will take some time but I'll try it and get back with report.

BTW, large DDR2 ram modules are insanely expensive around here!

Etorix · Sep 11, 2022

Then don't spend on obsolete DDR2 RAM but do get a HBA. You have to get rid of this 4-port SATA controller before it damages your pool beyond repair!

ChrisRJ · Sep 11, 2022

superfour said:
I understand that RAM is marginal, but it's a system for personal use.

The quoted statement is equivalent to many others I have seen here over the years. So the content below is really not meant to be picking on the author. Rather, I finally got motivated to write something more general (perhaps I will eventually make a resource out of this).

There are two aspects when running a system with fewer resources (RAM, CPU power, network bandwidth, etc.) than specified as the minimum by the vendor:

Performance
Stability

Most people think primarily (or exclusively) about performance. It is more or less the logical consequence of personal experience: If I give the system less RAM than recommend, it will slow down (sometimes dramatically). With the power we have in systems for personal use (PC and laptops) today, there is very little software that can be affected so badly, that it will actually crash or destroy data. That was pretty different 20+ years ago, when I ran a J2EE application server, an Oracle 8i database, and an Eclipse IDE on a laptop with 256 MB of RAM and a single core CPU with 800 MHz (not to forget the 20 GB HDD with 4200 RPM). Many people have not experienced this first hand. Lucky you! Pressing the SAVE button and literally waiting 20 minutes for the save operation (of a 100 KB XML file) to complete at 11pm is not a fun thing. If things didn't crash that is.

The challenge with ZFS is that it was designed as an enterprise storage system. It comes from an era (development started in 2001) when there were no SDDs, and HDDs had less than 100 GB. So in those days, if you needed e.g. 100k IOPS from your SAN (NAS as we know it today didn't really exist yet), you simply had to get enough HDDs (each with 200-300 IOPS) to reach that number. To make things a bit more fun, the redundancy that was required as a safety measure, meant that you needed not only 100k [IOPS] / 300 disks. But you needed the corresponding number of RAID-6 arrays/LUNs, or RAIDZ2 vdevs in the context of ZFS. So we are talking about 1000+ very expensive enterprise HDDs. This is the environment that ZFS was designed for. Yes, you could get cheaper storage systems from SUN. But in many cases it would be a 6/7-digit figure.

Why am I telling you all this? Because the design makes certain assumptions about the underlying hardware. Of course a lot has changed since then and systems are way more stable and robust in many ways. But the installations that have undergone thorough vendor testing, as well as being used in production by companies and serious(!) hobbyists, have in common that they follow the hardware recommendations. So the battle-hardened knowledge about ZFS comes from systems that are conventional (in a way one could even call it boring). Whenever you stray away from that path, the risk of something bad happening increases. In other words, not necessarily all of the assumptions are double-checked. If you happen to violate one that is not, anything can happen.

This is why, when you use less RAM than recommended as the minimum, there is a risk to loose data. The same happens with RAID controllers (or USB bridges), or anything else that limits the control of ZFS over the access to the disks. Of course there will also be a performance degradation. But that is less of a problem.

I hope this is useful to some folks. Feedback is more than welcome.

Davvo · Sep 12, 2022

TLDR: if you want no trouble, respecting minimum requirements is helpful.

If you upgrade, consider ECC RAM.

superfour · Sep 12, 2022

Thank you all for your input, especially Chris. Let this thread be useful to others... To my defense, I discovered FreeNAS like many many others I'm sure: reliable Linux, friendly to old hardware. Played around a lot before builing 'normal' arrays. And as I said above, this one was running for two years problem-free. I'd classify that as "proven", no?

So... let me say that I still cannot comprehend how a 2-year-old system used by one user only, never stressed, will fail with such a manner. What factor can *actually break* a disk, or break 8 disks in 60 minutes for that matter? Few RAM!? Three controllers simultaneously having strokes?

In the meanwhile, I have moved the array to another 12U8 server. Decent 8-core i7-4770TE with 16GB RAM.

The pool was imported, but shows the same image: 2 disks online, 8 disks degraded. I've started long smart tests.

Thanks again & I am very very open to suggestions!

Important Announcement for the TrueNAS Community.

Multiple disk failures confusion

Dabbler

Wizard

MVP

Dabbler

Dabbler

MVP

Dabbler

MVP

Dabbler

MVP

Dabbler

Wizard

Dabbler

Wizard

MVP

Dabbler

Wizard

Wizard

MVP

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Multiple disk failures confusion"

Similar threads