SOLVED Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub

Chris Moore · Feb 22, 2019

Mannekino said:
So far I replaced the drive and replaced the power cable. Should I be looking at a possible failure of the datacable or PSU?

It can be a bad data cable, but that usually gives CRC errors.

Mannekino said:
Can I detach that drive from the HBA and plug it directly into one of the SATA ports of the motherboard?

Yes, you can plug it to a different port and it would be a good test.

Mannekino said:
That should still work for the pool right?

Yes, ZFS is looking at the gptid, not the da#, so it doesn't care where the drive is connected as long as it is reachable.

Mannekino said:
If the problem goes away then it should either be the data cable or HBA.

Yes, it there are no errors when the drive is connected with a different cable to a different port, then it is either the cable or the HBA. I have heard of an HBA having a bad port before, but it doesn't happen often. You could test it by connecting the cable to the other port on the HBA.

Mannekino · Feb 22, 2019

There is something other strange also, since last night when I started the scrub more space has become available in the pool I haven't deleted anything. Could this be related to the scrub?

Chris Moore · Feb 22, 2019

Mannekino said:
There is something other strange also, since last night when I started the scrub more space has become available in the pool I haven't deleted anything. Could this be related to the scrub?

Do you have a backup? If you had unrecoverable data errors, that might have created free space.

Did you have snapshots? If you had snapshots, that expired, that would have freed space in the pool.

Mannekino · Feb 22, 2019

No, I don't have backups. I'm currently working on refitting my old NAS with different drives and using that as a backup. My plan was to install FreeNAS on that server today and make a snapshot of my media and system dataset and replicate it to my old NAS.

One more question: is there a way to see in the logs if the drive went offline during the night and came back online again? There is nothing in dmesg and this is the console output:

Code:

Feb 23 00:20:00 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=14250359306952643538
Feb 23 00:20:00 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=14949667362364330100
Feb 23 00:20:00 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=2841930983946545696
Feb 23 00:20:00 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=2377689923637963606
Feb 23 02:56:28 freenas.fritz.box ZFS: vdev state changed, pool_guid=11791038853189070155 vdev_guid=2377689923637963606

I'm going to shutdown the NAS now and attach the drive that gives errors directly to the motherboard.

One thing I can't get out of my mind is that PCI exhaust fan. So previously I thought that maybe the 7V mod could be the cause (far fetched but you never know). So I detached that fan earlier this week. And while it was detached (I completely removed it from the case) I haven't had any issues. I decided to buy a small fan controller for it with a Molex to 3-pin adapter cable also. This way I could plug it directly into the one remaining fan header i had of the motherboard and now the problem with this disk is back. This system has run without faults since september of last year. Seems so incredibly improbable that this could be in any way related.

This is how the fan is connected right now

Mannekino · Feb 22, 2019

Well this is even more worrying

I attached the drive that gave issues directly to the motherboard and now the second drive in line (this would be P2 of the SAS to SATA breakout cable) is not showing up.

Chris Moore · Feb 22, 2019

Mannekino said:
Well this is even more worrying

I attached the drive that gave issues directly to the motherboard and now the second drive in line (this would be P2 of the SAS to SATA breakout cable) is not showing up.

In an effort to protect your data, I would suggest moving all the data drives onto the black SATA ports on the bottom edge of the system board.
There is either something wrong with the SAS controller or something wrong with the data cable.

EDIT:

The bottom two slots on that board are only x4 in x8 physical slots. You should use one of the top two slots, which are x8 electrical and physical.

Mannekino · Feb 22, 2019

I did not know that, thank you so much for this information.

I had a spare SAS to SATA breakout cable. I moved the card to the second slot and replaced the data cable.

I've ordered two new SAS to SATA breakout cabled and 5 SATA cables.

What are those errors exactly then I've experienced last night the "Checksum" and "Write" errors. Does this mean some of my data is corrupted now, did the scrub take care of that?

Edit

The pool is healthy again, I'm going to do another scrub to see if it finished without problems.

I'm also going to order a two new SAS to SATA breakout cables and a five SATA cables.

I'm not seeing any error during but there is this small part after importing the ZFS pools Is there anything weird about these messages?

Chris Moore · Feb 22, 2019

That part where is say s "best uberblock found, condensing txg; txg is transaction group.
It looks like it condensed several transaction groups.

It looks like some transactions are being discarded. That would indicate a potential for data loss.

Mannekino · Feb 23, 2019

OK, I'm currently working on my old FreeNAS server so I can replicate snapshots to that server.

I'm running a scrub now on the pool on my current FreeNAS server. After I'm done installing my old FreeNAS server I'm first going to run a long self-test on the drives in that server. If both test complete successfully I'm going to make a snapshot and replicatie it.

Mannekino · Feb 23, 2019

So the scrub is running right now and about 60% done and it's scrubbing much faster for some reason. I don't know if that is due to the fact that my previous scrub was a couple days ago while normally I do that bi-weekly.

I am getting write and checksum errors again but on a different drive (different cable) but the pool didn't became degraded like it did earlier this week during a scrub, it still shows as healthy. Is there any conclusion I can draw from this? Could the current scrub be restoring a fault from earlier?

I bought another Dell PERC H200 from someone and a bunch of new SATA cables. I have a spare power supply already.

What I did yesterday and today was

Replaced the power cable yesterday
Moved the HBA controller from slot 4 to slot 2
Replaced the data cable (but not a new one a spare I had)
Put the data cable in Port B of the controller (was in port A)

How should I proceed? I should be getting my new parts Tuesday. Replace the controller first or try and swap the PSU today?

Code:

  pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Sat Feb 23 12:34:44 2019
    5.65T scanned at 630M/s, 4.34T issued at 484M/s, 7.09T total
    16K repaired, 61.15% done, 0 days 01:39:34 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    data                                            ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/a18b468c-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
        gptid/a2b7dae8-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
        gptid/a3f67a48-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     4
        gptid/22e2c68d-3409-11e9-b90e-0025907457e1  ONLINE       0     0     0

errors: No known data errors

Mannekino · Feb 23, 2019

The scrub finished with 0 errors but repaired some.

Code:

  pool: data
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 20K in 0 days 04:32:50 with 0 errors on Sat Feb 23 17:07:34 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    data                                            ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/a18b468c-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
        gptid/a2b7dae8-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     0
        gptid/a3f67a48-b464-11e8-9fd5-0025907457e1  ONLINE       0     0     5
        gptid/22e2c68d-3409-11e9-b90e-0025907457e1  ONLINE       0     0     0

errors: No known data errors

Then I gave the server a reboot and again same behavior. One disk was missing. I realized when I replaced the datacable I changed the order of the drives. It's again the drive attached the P1 of the datacable. So there must be something wrong with the HBA controller right? Different port (from A to B) of the controller different data cable still issues with the drive attached to P1.

Another think I just realized it, I recently bought a new PC with very expensive and feature rich motherboard with a buch of SATA cables. I'm going to pull out my HBA controller now and attach all the drives to the 4 SATA ports on the motherboard.

Mannekino · Feb 24, 2019

Another update.

I've attached the drives to the four SATA ports that were available on my motherboard. So everything is attached to those now (the two SSDs and four HDDs). I haven't had any errors so far. I did a scrub yesterday and it completed successfully without any Write or Checksum errors.

So I think it's safe to say my Dell PERC H200 is broken.

It was always the drive attached to P1 connector of the SAS to SATA breakout cable
I've swapped the drive with a new one
I've replaced the data cable
I've swapped the controller from slot 4 to slot 2 on the mother board
I've replaced the power cable for the drives
I've attached the data cable to port B on the controller also

Did a couple reboots after the scrub yesterday because that would trigger the drive not showing up after it was done booting before, didn't happen anymore. I've also created a snap shot of my data and system datasets and replicated them to my old NAS. Took about 18 hours.

I am still getting that "best uberblock found, condensing txg" message during boot, no idea what to do about that. Not getting any other errors.

pschatz100 · Feb 24, 2019

These intermittent problems are always frustrating. Hopefully, you got it fixed.

Last year, I had a problem with a bad molex to sata power adapter that I was using for one of my drives. There was no visual damage that I could see, but when I replaced the adapter then my problem disappeared. Before replacing the power adapter, I tried data cables, moving disks around, even replacing the power supply. At the end of the day, it was a silly power adapter!!

Redcoat · Feb 24, 2019

Mannekino said:
Can I detach that drive from the HBA and plug it directly into one of the SATA ports of the motherboard?

Yes, that's what I suggested earlier - (not exactly but same intention).

Mannekino · Feb 25, 2019

I'm not marking this as solved yet. I received my replacement Dell PERC H200 controller today but I haven't received new SAS to SATA breakout cable yet. So far everything seems well. When I got everything I will test the new controller.

One more comment I would like to make. A couple days ago I had a discussion with someone in the IRC channel regarding poor performance while copying a large file from dataset to dataset but inside the same pool (with a single VDev containing my four 4TB WD Reds).

I don't have the performance results of that test but it copied a ~16 GB file with 30 MB/sec average speed which I found to be very slow. Copying the same file to another dataset on another pool (with two mirrored SSDs on the target VDev) I got about 130 MB/sec. I just tried the same test again and the speeds were much better.

Copying a file to another dataset inside the same pool with a single VDev consisting of four 4TB WD Reds:

Code:

# pv <bigfile>.mkv > /mnt/data/backups/test.mkv
15.6GiB 0:03:36 [73.9MiB/s] [====================================================================================>] 100%

Copying the same file to a dataset in a different pool with a single VDev consisting of two mirrored SSDs:

Code:

# pv <bigfile>.mkv > /mnt/system/homes/<user>/test.mkv
15.6GiB 0:01:01 [ 261MiB/s] [====================================================================================>] 100%

@Chris Moore
Could the poor performance (30 MB/sec compared to 74 MB/sec average) also be due to the problematic HBA controller?

Mannekino · Mar 3, 2019

I'm marking the thread as solved. Earlier this week I received a replacement Dell PERC H200. I've put it in my FreeNAS server and attached the drives to it with a new data cable also. Haven't had any issues so far. Did two scrubs and multiple reboots. Disk hasn't dropped and no new errors in the pool.

Final conclusion: HBA controller broken. Maybe I will try to re-flash that controller in the near future and test it in another system.

Important Announcement for the TrueNAS Community.

SOLVED Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub

Chris Moore

Hall of Famer

Mannekino

Patron

Chris Moore

Hall of Famer

Mannekino

Patron

Mannekino

Patron

Chris Moore

Hall of Famer

Mannekino

Patron

Chris Moore

Hall of Famer

Mannekino

Patron

Mannekino

Patron

Mannekino

Patron

Mannekino

Patron

pschatz100

Guru

Redcoat

MVP

Mannekino

Patron

Mannekino

Patron

Similar threads

Important Announcement for the TrueNAS Community.

SOLVED Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub

Hall of Famer

Patron

Hall of Famer

Patron

Patron

Hall of Famer

Patron

Hall of Famer

Patron

Patron

Patron

Patron

Guru

MVP

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool degraded, one drive dropped and reappeared again and error during bi-weekly scrub"

Similar threads