Degraded drive x2 WD Red

Spoon · Sep 15, 2020

Hi All,

Got back in last Saturday to find Truenas warning me that the pool was in a degraded state. On closer inspection 2 of the four disks had been removed. Que panic mode whilst I frantically made sure I had a backup of the data.

Whilst that was on going I got onto ordering some replacement drives. I've ended up with 2x 10Tb WD Gold. They came with a 5y warranty and were cheaper than any of the Red or Red Pro drives.

After backing up the data I replaced the failed drives and hit replace. Pool status seems ok.

Issues/Questions:

I haven't burnt the drives in or checked them beyond running a quick SMART test from the GUI. I've hit "Manual S.M.A.R.T. Test" and "long" for each of the drives, will this be enough or should I do a more comprehensive test and if so how should I approach this? Offline drives individually and then test?

I noticed the WD Gold uses significantly more power than the WD Red it replaced (spec says 10W). Would four drives be too much for the Microserver Gen 8 PSU (specs in sig)

Smartctl -P show /dev/ada1 gives the following, I've never seen this before. Is it just that the model isn't in the database?

Code:

root@freenas[~]# smartctl -P show /dev/ada2
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-STABLE amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

No presets are defined for this drive. Its identity strings:
MODEL:        WDC WD102KRYZ-01A5AB0
FIRMWARE:    01.01H01
do not match any of the known regular expressions.
Use -P showall to list all known regular expressions.

Of the two drives that were booted out of the array one has the following:

Code:

Error 1 occurred at disk power-on lifetime: 25303 hours (1054 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 08 90 01 40 e0  Error: IDNF at LBA = 0x00400190 = 4194704

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 08 90 01 40 e0 00      09:09:41.315  WRITE DMA

From the Syslog I see the following:

Code:

Sep 12 15:43:59 freenas (ada3:ata3:0:1:0): RES: 51 10 c0 d2 97 20 20 01 00 08 00
Sep 12 15:43:59 freenas (ada3:ata3:0:1:0): Retrying command, 3 more tries remain
Sep 12 15:46:17 freenas (ada3:ata3:0:1:0): WRITE_DMA48. ACB: 35 00 f8 0f 43 40 78 00 00 00 08 00
Sep 12 15:46:17 freenas (ada3:ata3:0:1:0): CAM status: ATA Status Error
Sep 12 15:46:17 freenas (ada3:ata3:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 10 (IDNF )
Sep 12 15:46:17 freenas (ada3:ata3:0:1:0): RES: 51 10 f8 0f 43 78 78 00 00 08 00
Sep 12 15:46:17 freenas (ada3:ata3:0:1:0): Retrying command, 3 more tries remain
Sep 12 15:47:18 freenas (ada3:ata3:0:1:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Sep 12 15:47:18 freenas (ada3:ata3:0:1:0): CAM status: Command timeout
Sep 12 15:47:18 freenas (ada3:ata3:0:1:0): Retrying command, 0 more tries remain
Sep 12 15:47:18 freenas ada2 at ata3 bus 0 scbus1 target 0 lun 0
Sep 12 15:47:18 freenas ada2: <WDC WD40EFRX-68WT0N0 82.00A82> s/n WD-WCC4E3LHE7R4 detached
Sep 12 15:47:18 freenas ada3 at ata3 bus 0 scbus1 target 1 lun 0
Sep 12 15:47:18 freenas ada3: <WDC WD40EFRX-68WT0N0 82.00A82> s/n WD-WCC4E7TZH262 detached
Sep 12 15:47:18 freenas GEOM_MIRROR: Device swap0: provider ada2p1 disconnected.
Sep 12 15:47:18 freenas GEOM_MIRROR: Device swap0: provider ada3p1 disconnected.
Sep 12 15:47:18 freenas g_access(961): provider gptid/f0df517c-65c7-11e9-89d2-000f53160620 has error 6 set
Sep 12 15:47:18 freenas g_access(961): provider ada3 has error 6 set
Sep 12 15:47:18 freenas g_access(961): provider ada2 has error 6 set
Sep 12 15:47:18 freenas (ada3:ata3:0:1:0): Periph destroyed
Sep 12 15:47:18 freenas (ada2:ata3:0:0:0): Periph destroyed
Sep 12 15:47:23 freenas 1 2020-09-12T15:47:23.872590+00:00 freenas.ransome-pearce savecore 9116 - - /dev/ada1p1: Operation not permitted
Sep 12 15:47:25 freenas syslog-ng[9221]: syslog-ng starting up; version='3.25.1'
Sep 12 15:47:25 freenas GEOM_ELI: Device mirror/swap0.eli destroyed.
Sep 12 15:47:25 freenas GEOM_MIRROR: Device swap0: provider destroyed.
Sep 12 15:47:25 freenas GEOM_MIRROR: Device swap0 destroyed.
Sep 12 15:47:25 freenas kernel: pid 1200 (syslog-ng), jid 0, uid 0: exited on signal 6 (core dumped)
Sep 12 15:47:25 freenas GEOM_ELI: Device ada1p1.eli created.
Sep 12 15:47:25 freenas GEOM_ELI: Encryption: AES-XTS 128
Sep 12 15:47:25 freenas GEOM_ELI:     Crypto: hardware
Sep 12 15:47:27 freenas 1 2020-09-12T15:47:27.842730+00:00 freenas.ransome-pearce savecore 9264 - - /dev/ada1p1: Operation not permitted

The other shows no such error. Is it possible that this was incorrectly kicked out? The drives were bought back in 2015 so I'm thinking replacement is probably a good idea anyway. Just unsure if I should be trusting any data to the drive.

In good news looks like resilver works just fine in TrueNAS-12.0-BETA2.1!

System is as follows currently:

OS: TrueNAS-12.0-BETA2.1

HP Microserver Gen 8
CPU: Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
MEM: 16GB (2x8GB ECC)
OS Drive: 1x Intel® SSD 530 120GB
Storage: 2x 4TB WD Red, 2x 10TB WD Gold
RAID: ZFS RaidZ2
NET: Chelsio T320, 2 ports (10GB single DAC link)

PhiloEpisteme · Sep 15, 2020

Spoon said:
I haven't burnt the drives in or checked them beyond running a quick SMART test from the GUI. I've hit "Manual S.M.A.R.T. Test" and "long" for each of the drives, will this be enough or should I do a more comprehensive test and if so how should I approach this? Offline drives individually and then test?

In the ideal case you have spare drives on hand already and have already burned them in. That way they are known good drives. It sounds like this was not an option for you in this case. In the end you have to balance risks. Burning in replacement drives while the pool is healthy is the safest option. If you are a degraded pool you likely still want to burn them in, especially if you have backups of your data and have redundancy remaining. The thing you're balancing is the risk of actually losing data. If you do not have backups, the risk goes up. If you do not burn in drives and just throw them in they may fail which increases risk. I could imagine a scenario where not burning in a drive may be prudent, but my guess is that those situations almost always stem from not having adequate replacements on hand before something goes wrong and not having reliable backups.

Spoon said:
I noticed the WD Gold uses significantly more power than the WD Red it replaced (spec says 10W). Would four drives be too much for the Microserver Gen 8 PSU (specs in sig)

First, feel free to put the actual specs in the post. If your sig changes future folks reading your post cannot tell what you're talking about.

Also, if this link is to the correct machine, you're operating with a 200W supply.

Document Display | HPE Support Center

support.hpe.com

With only 4 drives and minimal usage that is likely okay. 200W for the power supply isn't much, but you don't seem to be throwing much at it.

Spoon said:
Smartctl -P show /dev/ada1 gives the following, I've never seen this before. Is it just that the model isn't in the database?

That looks to be correct. Use smartctl -a /dev/ada to get the full print out. Did you run a short & long test on the drive?

Spoon said:
The other shows no such error. Is it possible that this was incorrectly kicked out? The drives were bought back in 2015 so I'm thinking replacement is probably a good idea anyway. Just unsure if I should be trusting any data to the drive.

I've had a drive have issues right after an unscheduled reboot. The drive was recognized shortly after and I was able to add it back to the pool without issue. It has been humming away for over a year since with no new SMART errors. If the error above was just caused by the unscheduled reboot that might be fine. If it was caused by the driving biting the dust you won't want to use it. Short and long tests are your friend here. Check the output of smart tests and watch the error counts.

Also, am I reading correctly that you're using RAIDZ2 with 2x 4TH drives and 2x 10TB drives?

Spoon · Sep 15, 2020

Hi PhiloEpisteme

PhiloEpisteme said:
In the ideal case you have spare drives on hand already and have already burned them in. That way they are known good drives. It sounds like this was not an option for you in this case. In the end you have to balance risks. Burning in replacement drives while the pool is healthy is the safest option. If you are a degraded pool you likely still want to burn them in, especially if you have backups of your data and have redundancy remaining. The thing you're balancing is the risk of actually losing data. If you do not have backups, the risk goes up. If you do not burn in drives and just throw them in they may fail which increases risk. I could imagine a scenario where not burning in a drive may be prudent, but my guess is that those situations almost always stem from not having adequate replacements on hand before something goes wrong and not having reliable backups.

At the time I had no drives spare and the backup hadn't been run in a while.... That's been resolved. I threw the drives in without burning in, as I had bought the 4TB all originally together and was worried about another failing before all the data had been copied. As the data is now replicated and safe, I'll do a proper burn in on the two new drives along with any other replacements.

PhiloEpisteme said:
First, feel free to put the actual specs in the post. If your sig changes future folks reading your post cannot tell what you're talking about.

Fixed.

PhiloEpisteme said:
Also, if this link is to the correct machine, you're operating with a 200W supply.
With only 4 drives and minimal usage that is likely okay. 200W for the power supply isn't much, but you don't seem to be throwing much at it.

Nothing serious just archive of project data, TimeMachine for a MacBook Pro and the Windows equivalent for a workstation. Integrity of data is the main concern and reasonably quick access.

PhiloEpisteme said:
That looks to be correct. Use smartctl -a /dev/ada to get the full print out. Did you run a short & long test on the drive?

I've had a drive have issues right after an unscheduled reboot. The drive was recognized shortly after and I was able to add it back to the pool without issue. It has been humming away for over a year since with no new SMART errors. If the error above was just caused by the unscheduled reboot that might be fine. If it was caused by the driving biting the dust you won't want to use it. Short and long tests are your friend here. Check the output of smart tests and watch the error counts.

Ran both a short and a long test on the new and old drives. One of the 4TB drives that was rejected by TrueNAS is fubar for sure. The other seems to pass all tests have no errors, and allow wiping. Might give it a try elsewhere with caution as I can't seem to find anything wrong with it.

PhiloEpisteme said:
Also, am I reading correctly that you're using RAIDZ2 with 2x 4TH drives and 2x 10TB drives?

Originally it was 4x4TB in RAIDZ2. Not the most efficient but I figured I could have any two drives fail and it would still be operable, that and it was setup as a learning exercise rather than anything serious.. Five years later and its still going.

The pool is currently 51% used so will need upgrading soon ish but not yet. I figured swap out the two 4Tb and then as it gets closer to 80% switch out the last two. I’m also unsure if I should buy all one type or if I should stick two other HDDs like the exos in so that there is some drive variety?

Many thanks for the response.

PhiloEpisteme · Sep 15, 2020

Spoon said:
I'll do a proper burn in on the two new drives along with any other replacements.

How do you plan to do the burn in? If the new drives are already a part of your pool you can't really do the traditional burn in.

Spoon said:
The pool is currently 51% used so will need upgrading soon ish but not yet. I figured swap out the two 4Tb and then as it gets closer to 80% switch out the last two. I’m also unsure if I should buy all one type or if I should stick two other HDDs like the exos in so that there is some drive variety?

I think most home uses cases do not benefit from manufacture variety (personal opinion). Did you consider using two mirror vdevs rather than 1 RAIDZ2 vdev?

Anyway, it sounds like you've got things on track, yes? I would recommend you have at least 1 replacement drive on hand and burned in. As always, also keep in mind that a single pool, no matter the vdev parity, is not a replacement for a backup. For any data you really care about make sure you keep regular backups.

Spoon · Sep 15, 2020

PhiloEpisteme said:
How do you plan to do the burn in? If the new drives are already a part of your pool you can't really do the traditional burn in.

I had thought detach the new drives individually and perform a burn on the drive before reattaching. Burn would be short smart/conveyancing/long smart/bad blocks/smart as per the instructions at:

Hard Drive Burn-in Testing

@jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and...

www.ixsystems.com

PhiloEpisteme said:
I think most home uses cases do not benefit from manufacture variety (personal opinion). Did you consider using two mirror vdevs rather than 1 RAIDZ2 vdev?

I have considered two mirror vdevs. I picked RAIDZ2 as I liked the idea that any 2 HDD could fail. With that said the speed improvement would probably be worth it. Would the following be possible?

Offline the two mirrors from the RAIDZ2 pool and create a mirrored pair
Copy the Original pool data from the degraded RAIDZ2 pool to the Mirrored vdev
swap the shares to the new mirrored pool
Destroy the old pool, remove the drives and add another mirrored vdev
Add to other mirrored vdev to create 2x mirrored vdev

As far as I can see the issue with this is that I would end up with the old data primarilly on 1 mirrored pair? so any performance gains would only apply to new data. Is this a correct reading of the situation?

PhiloEpisteme said:
Anyway, it sounds like you've got things on track, yes? I would recommend you have at least 1 replacement drive on hand and burned in. As always, also keep in mind that a single pool, no matter the vdev parity, is not a replacement for a backup. For any data you really care about make sure you keep regular backups.

Yes think its in hand thank you. Its my own fault really as I have been putting off upgrading the storage despite its age as I'd have liked an all SSD system as a replacement. As it stands that will have to wait. Prices are *almost* getting there...

PhiloEpisteme · Sep 16, 2020

Spoon said:
I had thought detach the new drives individually and perform a burn on the drive before reattaching. Burn would be short smart/conveyancing/long smart/bad blocks/smart as per the instructions at:

Hard Drive Burn-in Testing

@jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and...

www.ixsystems.com

So, that is a common way to do burn in. Do note though that bad blocks is destructive. Running bad blocks will mean you essentially have to resilver that drive anew when you're done. Additionally, that burn in process will take dats for a disk that size. Typically users do this before they use the disk in the pool. Doing so after results in the pool doing two resilvers and adds stress.

Spoon said:
Offline the two mirrors from the RAIDZ2 pool and create a mirrored pair

Yes, if you mean create a new pool with a single vdev composed of those two drives as a mirror. Do note that if you do this you will have NO parity in your RAIDZ2 vdev once you offline those two disks. If anything happened to your RAIDZ2 pool while you were doing this you could lose all of your data. A backup is very important here.

Spoon said:
Copy the Original pool data from the degraded RAIDZ2 pool to the Mirrored vdev

swap the shares to the new mirrored pool

Destroy the old pool, remove the drives and add another mirrored vdev

Add to other mirrored vdev to create 2x mirrored vdev

These seems correct to me. Do note that step 3 might be better read as "destroy the pool" and then 4 as "extend the new pool with another mirror vdev".

You may run into issues where FreeNAS is not happy with vdevs composed of different sized disks. zfs doesn't care, but FreeNAS might. If this is the case you may have to do some steps manually which, while not generally advised, is possible if done correctly. I tend to avoid that in my builds so I cannot say for sure which situations FreeNAS gets grumpy about when mixing disk sizes.

Spoon said:
As far as I can see the issue with this is that I would end up with the old data primarily on 1 mirrored pair? so any performance gains would only apply to new data. Is this a correct reading of the situation?

I can't speak to exactly how FreeNAS stores data when you extend a vdev. ZFS is copy on write though, so I would expect that as files change zfs will tend to distribute new writes across the entire pool. I'm sure someone else on the forums is more knowledgeable about this than I am.

I can't emphasize enough how important backups are. You're in a situation where doing the wrong thing could harm your pool and your data. Keeping reliable and complete backups significantly mitigates this risk. Whatever you decide to do be sure you understand the full set up steps before you set out. When I was first building and testing my systems I was too hasty and ruined many pools. Fortunately for me these were test pools. I never put data on my systems until I understood it well enough to know how to handle common maintenance steps.

Spoon · Sep 16, 2020

PhiloEpisteme said:
So, that is a common way to do burn in. Do note though that bad blocks is destructive. Running bad blocks will mean you essentially have to resilver that drive anew when you're done. Additionally, that burn in process will take dats for a disk that size. Typically users do this before they use the disk in the pool. Doing so after results in the pool doing two resilvers and adds stress.

Yes, if you mean create a new pool with a single vdev composed of those two drives as a mirror. Do note that if you do this you will have NO parity in your RAIDZ2 vdev once you offline those two disks. If anything happened to your RAIDZ2 pool while you were doing this you could lose all of your data. A backup is very important here.

These seems correct to me. Do note that step 3 might be better read as "destroy the pool" and then 4 as "extend the new pool with another mirror vdev".

You may run into issues where FreeNAS is not happy with vdevs composed of different sized disks. zfs doesn't care, but FreeNAS might. If this is the case you may have to do some steps manually which, while not generally advised, is possible if done correctly. I tend to avoid that in my builds so I cannot say for sure which situations FreeNAS gets grumpy about when mixing disk sizes.

I can't speak to exactly how FreeNAS stores data when you extend a vdev. ZFS is copy on write though, so I would expect that as files change zfs will tend to distribute new writes across the entire pool. I'm sure someone else on the forums is more knowledgeable about this than I am.

I can't emphasize enough how important backups are. You're in a situation where doing the wrong thing could harm your pool and your data. Keeping reliable and complete backups significantly mitigates this risk. Whatever you decide to do be sure you understand the full set up steps before you set out. When I was first building and testing my systems I was too hasty and ruined many pools. Fortunately for me these were test pools. I never put data on my systems until I understood it well enough to know how to handle common maintenance steps.

Just to finish of this thread. The system was successfully shifted from RAIDZ2 with 4x 4TB HDD to 2x vdev Mirror with 2x 4TB & 2x 10TB. Synthetic tests show a significant boost to performance in both read and write. OS version was TrueNAS-12.0-BETA2.1 and no issues were encountered. Spare 10TB is on its way and will be burnt in and kept as a cold spare.

Many thanks for your help PhiloEpisteme!

Important Announcement for the TrueNAS Community.

Degraded drive x2 WD Red

Spoon

Dabbler

PhiloEpisteme

Guru

Document Display | HPE Support Center

Spoon

Dabbler

PhiloEpisteme

Guru

Spoon

Dabbler

Hard Drive Burn-in Testing

PhiloEpisteme

Guru

Hard Drive Burn-in Testing

Spoon

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Degraded drive x2 WD Red

Dabbler

Guru

Dabbler

Guru

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Degraded drive x2 WD Red"

Similar threads