Recurring Disk Errors

Status
Not open for further replies.

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
I'm beginning to think that my server is haunted.

About a year ago, I had a problem where drives would randomly start showing read/write/cksum errors and then become UNAVAIL (all as reported by zpool status).

In trying to figure out where these errors were coming from, I replaced two of the four drives, added cooling, replaced the SATA cables, and switched to a SAS9211-8I SATA controller (from motherboard SATA). For unrelated reasons, I also upgraded the CPU. I wasn't sure which one of these things did the trick, but the system showed no disk errors whatsoever for the next nine months (up the whole time, never rebooted), and I thought the issue was finally solved.

However, I had to power cycle the server last week - and now the problem seems to be back. Initially, one drive started throwing write and checksum errors, then it went UNAVAIL. Then, another drive started throwing errors, and it went UNAVAIL. Then, a third drive started throwing errors. This is the same pattern I was dealing with last year. I shut down the server and replaced the first drive with a brand-new spare.

Four days later, the brand-new drive has now gone UNAVAIL with 5 write errors, and I'm about to scrap this whole machine and give up.

These drives sit around 30 degrees, even under load, and system load is fairly light in general. No SMART errors on the drives.

Specs are as follows:
  • Supermicro X10SLL-F-O
  • Xeon E3-1271v3
  • 32GB Crucial DDR3L ECC RAM
  • 4x WD Red 3TB
  • SAS9211-8I SATA Controller
  • FreeNAS 9.10-Stable
System log output from last night is attached.

Has anyone ever experienced something like this before? I'm willing to try anything at this point.
 

Attachments

  • 170503log.txt
    93.5 KB · Views: 264

negabinary

Dabbler
Joined
Jun 24, 2016
Messages
11
Have you checked/replaced the PSU?

Thanks for the reply.

The voltages all look OK to me, and the system is attached to a UPS, so the line power should be relatively consistent. Is there something else I can do to check the health of the PSU?

1.05V PCH 1.041 Volts
1.2V BMC 1.251 Volts
12V 11.744 Volts
3.3V AUX 3.265 Volts
3.3VCC 3.214 Volts
5V Dual 4.973 Volts
5VCC 4.973 Volts
VBAT 3.000 Volts
Vcpu 1.845 Volts
VDIMM 1.470 Volts
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Perhaps try changing the port that drive connects to and see if problem stays with drive or with the port.
 
Status
Not open for further replies.
Top