Patrick_3000
Contributor
- Joined
- Apr 28, 2021
- Messages
- 167
I have two pools: an HDD pool for data and an SSD NVME pool for VMs. Since yesterday, the SSD NVME pool has been intermittently removing devices.
In particular, the SSD pool is a three-way mirror, 3 x 4 TB, with consumer grade SSDs which I know some people don't recommend, but they have low utilization. The pool is only 18.9% full.
The SSD pool has functioned perfectly for about a year. Yesterday, however, I got a notice from SCALE that one device was removed by the administrator so that the pool was unhealthy. I ordered a replacement SSD and, today, replaced the SSD that was taken offline. The pool resilvered, after which it said it was healthy, with all three SSDs functioning properly.
About an hour later, however, I got a notice from SCALE that one of the other SSDs was removed by the administrator--a different manufacturer this time (a Teamgroup SSD rather than a Crucial SSD that was removed yesterday). It seems to me the chance of both SSDs, from different manufacturers, failing within a day of each other after functioning properly for a year is low. And in fact, rebooting the server caused the supposedly "failed" SSD to all of a sudden be back online.
So, now I'm in the strange situation where all three disks in the three-way mirror are online, but the pool still shows a status of "not healthy." Also, unfortunately, SMART tests on SCALE do not work for NVME SSDs, so I don't know how to diagnose the problem, if there even is a problem. Does anyone know what to do? I''m afraid I'm going to keep getting intermittent notices that devices in the pool are removed, even if they're not failing.
The SCALE version is 23.10.2, and in case it's relevant, the CPU is a Ryzen 7 Pro 5750G, and the motherboard is an ASRock Rack x570d4u-2L2T, with 128 GB of ECC RAM.
In particular, the SSD pool is a three-way mirror, 3 x 4 TB, with consumer grade SSDs which I know some people don't recommend, but they have low utilization. The pool is only 18.9% full.
The SSD pool has functioned perfectly for about a year. Yesterday, however, I got a notice from SCALE that one device was removed by the administrator so that the pool was unhealthy. I ordered a replacement SSD and, today, replaced the SSD that was taken offline. The pool resilvered, after which it said it was healthy, with all three SSDs functioning properly.
About an hour later, however, I got a notice from SCALE that one of the other SSDs was removed by the administrator--a different manufacturer this time (a Teamgroup SSD rather than a Crucial SSD that was removed yesterday). It seems to me the chance of both SSDs, from different manufacturers, failing within a day of each other after functioning properly for a year is low. And in fact, rebooting the server caused the supposedly "failed" SSD to all of a sudden be back online.
So, now I'm in the strange situation where all three disks in the three-way mirror are online, but the pool still shows a status of "not healthy." Also, unfortunately, SMART tests on SCALE do not work for NVME SSDs, so I don't know how to diagnose the problem, if there even is a problem. Does anyone know what to do? I''m afraid I'm going to keep getting intermittent notices that devices in the pool are removed, even if they're not failing.
The SCALE version is 23.10.2, and in case it's relevant, the CPU is a Ryzen 7 Pro 5750G, and the motherboard is an ASRock Rack x570d4u-2L2T, with 128 GB of ECC RAM.
Last edited: