ZFS pool degraded, unsure of cause. SMART reports no errors. Hoping for suggestions on what to do.

boerbiet · Aug 24, 2015

Hello,

I apologise for the lengthy post in advance.

The past weekend I ran into some trouble with my FreeNAS server. I'm not really an expert, just a home user with a bit above average knowledge on computers and linux / unix. I've been wandering around on the web all day, trying to get a grasp on what's going on in my server, and before adding days and days to that, I was hoping a post here could save me at least some time via an insightful suggestion from a more knowledgeable member.

I use an ASRock C2550D4I board with 16GB of ECC RAM and six WD RED 4TB disks in RAID-Z2. The past weekend, my server started sending me emails reporting SMART errors on the disks (all disks on the same Marvell controller), specifically "Read SMART Error Log Failed" and "Read SMART Self-Test Log Failed". I proceeded to scan the drives using smartctl, but nothing turned up and everything looked perfectly fine, so I ignored it for a bit.

During the night, however, more serious emails arrived, first saying "The volume nirvana (ZFS) state is DEGRADED", followed by "The volume nirvana (ZFS) state is UNAVAIL". In the weekend, I found one of the controllers on the motherboard had a firmware update, so I installed that along with the BIOS update that was still pending. Booting the system showed me the volume again, so things were looking up.

However, it didn't take long for the volume to become degraded again. A zpool status revealed:

Code:

        NAME                                            STATE     READ WRITE CKSUM
        nirvana                                         DEGRADED     0     0     0
          raidz2-0                                      DEGRADED     0     0     0
            gptid/c26093b1-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
            gptid/c34d8533-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
            gptid/c437e8d4-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
            gptid/c5272d0f-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
            gptid/c6184f5b-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
            gptid/c702c33f-f91e-11e3-82f1-d05099264e54  DEGRADED     0     0   273  too many errors  (repairing)

So that looks like a dying drive, right? SMART, however, still claims everything is perfectly fine. At the same time, dmesg and my server's console output are going crazy with the following messages (continuously):

Code:

(ada5:ahcich5:0:0:0): READ_DMA48. ACB: 25 00 38 dd fe 40 a6 01 00 00 00 01
(ada5:ahcich5:0:0:0): CAM status: Command timeout
(ada5:ahcich5:0:0:0): Retrying command
(ada5:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada5:ahcich5:0:0:0): CAM status: Command timeout
(ada5:ahcich5:0:0:0): Error 5, Retries exhausted
(ada5:ahcich5:0:0:0): READ_DMA48. ACB: 25 00 38 dd fe 40 a6 01 00 00 00 01
(ada5:ahcich5:0:0:0): CAM status: ATA Status Error
(ada5:ahcich5:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
(ada5:ahcich5:0:0:0): RES: 51 04 00 00 00 00 00 00 00 00 00
(ada5:ahcich5:0:0:0): Error 5, Retries exhausted
(ada5:ahcich5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 38 e0 fe 40 a6 01 00 01 00 00
(ada5:ahcich5:0:0:0): CAM status: ATA Status Error
(ada5:ahcich5:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada5:ahcich5:0:0:0): RES: 41 40 cf e0 fe 40 a6 01 00 00 00
(ada5:ahcich5:0:0:0): Retrying command

Due to the timeouts, the scrub currently in progress is stating it will take over 12 more hours to complete (and going up, probably due to the timeouts above?), even though it is at 99.07%.

What it boils down to is that I am wondering which actions I should take:

Accept the drive is broken and file an RMA (though I wonder if I have any chance of success, with no SMART errors showing).
Point towards the controller, since before the FW upgrade it also cried foul of ada2, 3 and 4's SMART error logs and it seems to be timing out all the time on ada5.
Wait for the scrub to complete, however more time that will take, perform a long SMART test on the drive and check those results first.
Disable AHCI in the BIOS and try with that? (someone suggested this, though I have no real idea if it could make a difference as I don't know much about AHCI)
Should I even continue the scrub? It has made 0.02% progress in 2 hours and almost seems to be stuck.
Something else?

Since I already pasted so much information, I'll post the SMART output for the drive as well:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       17749
  3 Spin_Up_Time            0x0027   188   178   021    Pre-fail  Always       -       7566
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10294
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       12
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       212
194 Temperature_Celsius     0x0022   109   096   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

Any suggestions and tips are very much welcomed!

Best regards,
boerbiet

Darren Myers · Aug 24, 2015

dont use the marvell controller(s) , they have known issues.

danb35 · Aug 24, 2015

+1. Also, the drive is really running too hot, and has a few offline sectors. Neither of these should be causing the drive to drop out of the pool, but they aren't good things. Since you didn't post the full SMART output, are you running regular SMART tests? What does the SMART test log show?

SweetAndLow · Aug 24, 2015

marvel controllers have timout issues, don't use them.

Nick2253 · Aug 24, 2015

As others have pointed out, use the Intel controller on your board, not the Marvell controller.

A common worry I see is that the Intel controller only has two SATA 3 ports (in addition to the 4 SATA 2 ports), whereas the Marvell controller has 6 SATA 3 ports. However, SATA 2 is more than capable of delivering all the speed from your HDDs. Unless you are using SSDs or high RPM drives, SATA 3 is unnecessary. If you have other devices in your system (like CD drive), use the Marvell controller for those.

To move the drives, shut down the server, move the drives to the Intel controller, and reboot. FreeNAS should automatically recognize everything. I would do this first before trying anything else. Then go ahead and press on with the scrub.

boerbiet · Aug 24, 2015

Thank you very much for your replies!

When purchasing the board, the Marvell controller was the only small worry I had, but I didn't know it could cause these kinds of situations. I am at work right now and with our strict proxy settings am unable to access the server from here and cannot provide any details on SMART test logs. Drive temperature may indeed be a bit of a concern... the casing isn't all that big and we've been having a hotter than average summer, so the room the machine is in has been at 30+ degrees celsius most of the time the past few weeks. I'll monitor it for the future.

On the Intel controller: I did opt for using 4 of the Marvell's ports for the raid due to them being SATA3. I did not know how much of an impact SATA2 would have on speed / performance, and the rookie in me went with the faster ports. I do have a 7th disk in the system, being the small SSD it boots off, so I will move all raid disks to the Intel controller and the SSD to a Marvell port.

The good news is that the scrub finished while I was sleeping. I'll start a new one after I switched ports and have booted the system.

Thank you again for your insights!

INCSlayer · Aug 25, 2015

if your using the Marvell SE9230 ports make sure you have the latest firmware
http://www.asrockrack.com/support/ipmi.asp#Marvell9230

i have the C2750D4I version of the board and i have (so far and i am replacing it because of the marvell controller) not had any issues and according to asrock the update should help with stability.

obviously this shouldnt affect your drives or anything but as always make sure you have backups before you do anything that can affect your data

boerbiet · Aug 25, 2015

One of the actions I performed yesterday after the volume went offline was upgrading that firmware, along with the BIOS. I don't know if it had any effect, but only ada5 reported errors after that whereas all of the drives did before. The machine had been running without any problems for over a year for me, though, so having it running with no problems for a period of time is no guarantee, obviously

.

Robert Trevellyan · Aug 25, 2015

boerbiet said:
only ada5 reported errors after that whereas all of the drives did before.

It's possible the issues with the controller were hiding real issues with ada5. Proceed with caution.

boerbiet · Aug 27, 2015

Okay, when I got back from work Tuesday I moved all the drives to the Intel controller and the boot drive to the Marvell controller. FreeNAS recognised the volume and I started a scrub, which has just completed. The final CKSUM count before I switched controllers was 1.06k.

Code:

  scan: scrub repaired 65.9M in 40h32m with 0 errors on Thu Aug 27 09:24:30 2015
config:

    NAME                                            STATE     READ WRITE CKSUM
    nirvana                                         ONLINE       0     0     0
     raidz2-0                                      ONLINE       0     0     0
       gptid/c26093b1-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
       gptid/c34d8533-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
       gptid/c437e8d4-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
       gptid/c5272d0f-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
       gptid/c6184f5b-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     0
       gptid/c702c33f-f91e-11e3-82f1-d05099264e54  ONLINE       0     0     8

errors: No known data errors

So the CKSUM count went down from 1060 to 8, which is nice. The Current_Pending_Sector count in the SMART report of the disk that showed problems, however, went up from 5 to 7, so I am inclined to file for an RMA on it anyway. I may monitor it for the next few days to see if the count goes up any further, which will mean a definite RMA.

*edit* I filed for RMA after my short SMART test ended up with "Completed: read failure".

Robert Trevellyan · Aug 27, 2015

boerbiet said:
scrub repaired 65.9M in 40h32m with 0 errors

This is a good outcome.

Important Announcement for the TrueNAS Community.

ZFS pool degraded, unsure of cause. SMART reports no errors. Hoping for suggestions on what to do.

boerbiet

Cadet

Darren Myers

Guru

danb35

Hall of Famer

SweetAndLow

Sweet'NASty

Nick2253

Wizard

boerbiet

Cadet

INCSlayer

Contributor

boerbiet

Cadet

Robert Trevellyan

Pony Wrangler

boerbiet

Cadet

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

ZFS pool degraded, unsure of cause. SMART reports no errors. Hoping for suggestions on what to do.

Cadet

Guru

Hall of Famer

Sweet'NASty

Wizard

Cadet

Contributor

Cadet

Pony Wrangler

Cadet

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "ZFS pool degraded, unsure of cause. SMART reports no errors. Hoping for suggestions on what to do."

Similar threads