Multiple Disk Errors

clarknova

Explorer
Joined
Sep 22, 2015
Messages
66
This is quite the mess, so I apologise in advance if it's confusing. It's also possible that I get some steps out of order as I attempt to recount what has been done to this point.

Freenas 11.2 U3 (U7)
Supermicro 847-12 (Dell R630)
Supermicro 4U disk shelf SAS

We had Freenas 11.2 U3 running on the Supermicro hardware for a total of 8U and 60+ 4TB SATA disks. All disks were allocated to a single ZFS pool in multiple raidz2 vdevs. I believe it was 8 disks per vdev.

The initial problem was that the server began resetting spontaneously. It has power supplies and was able to run with either of them removed, so I thought it was unlikely that a bad power supply was causing the rests. I updated Freenas to 11.2 U7, but the resets continued to occur. We were seeing some reported errors on 3 disks, so I replaced them and started to resilver one of them.

As the resilver was in progress Freenas began resetting more frequently, to the point it would no longer stay up long enough to import the zfs pool. The two attached photos of the console were taken during this time. Some of the visible errors include:

probe0:mps0:0:54:0): CAM status: CCB request completed with an error
panic: I/O to pool 'pool1' appears to be hung on vdev guid 4910667192425213778 at 'dev/gptid/7f....'

Freenas was able to start and apparently run stable if we disconnected the SAS drive shelf, so we removed the zfs pool and were able to determine that one of the enclosures on the attached shelf was not reliably detecting the attached disks (all of the disks on the back side of the shelf).

We installed a SAS HBA (LSI 9200-8e) and attached the drive shelf to that instead of the on-board Supermicro SAS ports. In this configuration Freenas could see all of the disks but would still reset if we tried to import the pool.

We replaced the 4U Supermicro head with a Dell r630 and added a second 4U Supermicro dumb shelf to accommodate the disks that came out of the SM head. We placed the LSI HBA into the Dell and chained the two SM shelves to it. Now Freenas will boot fine if the shelves are disconnected, but once the shelves are connected the disk errors come non-stop. The errors are reported on many, perhaps all attached disks. If I attempt to boot Freenas with the shelves connected, the errors appear on the console non-stop and Freenas appears to never start fully. I am not able to connect to the web UI or ssh. The idrac console shows the string of errors but never displays the expected Freenas console. Screen capture of the idrac console shows this situation.

At some point we also replaced the original dumb SM shelf with another one, but the result did not change.

At this point I believe I have 1 disk that is mid-resilvered and 2 disks that have been replaced with new disks that have not been resilvered. There should be enough redundancy in place that the pool data is still intact, unless something has been damaged as we've attempted to troubleshoot this, but I don't believe that to be the case.

We've now replaced all of the involved hardware except the disks. I don't know why we are seeing errors on most or all disks when the pool was apparently operating normally until suddenly nothing was working, unless many disks were suddenly damaged by my actions or a hardware failure.

1. What has caused this failure?
2. What is the recommended action to try to get this pool back?

edit: added attachments
more error text for searchability: "CAM status: CCB request completed with an error"
"syslog-ng[5445]: I/O error occurred while writing; fd='23', error='Message too long (40)'
 

Attachments

  • 20191219_142306.jpg
    20191219_142306.jpg
    347.7 KB · Views: 170
  • 20191219_143627.jpg
    20191219_143627.jpg
    374.9 KB · Views: 183
  • Screenshot at 2020-01-02 10-32-06.png
    Screenshot at 2020-01-02 10-32-06.png
    35.8 KB · Views: 193

clarknova

Explorer
Joined
Sep 22, 2015
Messages
66
After approximately 24 hours of uptime this is what the console looks like. It takes about 2 seconds for these messages to scroll up the entire screen, so roughly 20-25 lines per second.
 

Attachments

  • Screenshot at 2020-01-03 08-18-29.png
    Screenshot at 2020-01-03 08-18-29.png
    42.8 KB · Views: 188

clarknova

Explorer
Joined
Sep 22, 2015
Messages
66
I replaced the HBA with a new one and the errors are gone now. Always fun troubleshooting when you replace one bad part with another bad part. :P
 
Top