Drive passed long SMART test but shows multiple checksum errors

Z300M · Jun 28, 2013

FreeNAS 8.3.1-p2

I have four Seagate ST32000641AS drives in RaidZ1 (plus an identical one as a spare) mounted in an iStarUSA 5-drive cage. One of these drives has started showing CKSUM errors after almost every copy operation. I ran the long SMART test, and the drive passed. I have swapped drives in the cage, so the problem drive is now connected by a different cable to a different motherboard port (thus ruling out the cable and the motherboard as possible problems), but the checksum errors continue with that specific drive -- and only that one.

The drive is still in warranty, but with no errors showing up in the SMART test, Seagate will not replace it. (IAC, based on past experience, Seagate would replace it by a refurbished current-production model (ST2000DM001, which receives lower user ratings on NewEgg).

What could be causing this problem?

jgreco · Jun 29, 2013

You haven't described your system. Are you using ECC memory/CPU/MB? Is your system running warm? It is possible for the data to be getting corrupted within the server itself, and the drive to be doing everything right. ECC memory significantly reduces the likelihood of that, as does making sure your server is running in a comfortable and appropriate environment.

I'll concede that your testing appears to make those possibilities unlikely, but heat in particular can also cause problems for the embedded controller on the drive. You may have one that's marginal. You're really best off swapping in the spare for now, then pulling the troublesome drive and running Seagate SeaTools on it until it hopefully spots a problem and pops a result code out, at which point Seagate probably won't question it.

Seagate's replacement policy is not quite as you describe; they will make a reasonable effort to replace your product with a refurbished unit of the same model. However, since the refurb process takes some substantial time, it is possible for them to run out of a given model, in which case they will often substitute a product that they consider "of equal or better value."

If I can be blunt, worrying about a slight ratings difference between the legacy 2TB and the DM001 model (both four stars! but the legacy has 62% 5 star compared to the DM's 47% 5 star) is a silly thing. The competence of NewEgg reviewers is not sufficiently consistent for that to be meaningful in a way you can rely on. It can actually be kind of fun to rummage through the one-stars, I find myself going "Idiot... moron...." as I read some NewEgg 1-stars. Try it. You'll run across stuff like "My coffee cup won't sit flat on top of it" or "I can't figure out how to insert the disk into my floppy drive" mixed in with some of the more-reasonable complaints.

Z300M · Jun 29, 2013

jgreco said:
You haven't described your system. Are you using ECC memory/CPU/MB? Is your system running warm? It is possible for the data to be getting corrupted within the server itself, and the drive to be doing everything right. ECC memory significantly reduces the likelihood of that, as does making sure your server is running in a comfortable and appropriate environment.

Non-ECC. It's an Asus F1A75-V Pro motherboard, AMD A4 3400 CPU, 16GB (2 sets of 2x4GB) G.Skill F3-12800CL9D-8GBRL RAM. No overclocking.

I'll concede that your testing appears to make those possibilities unlikely, but heat in particular can also cause problems for the embedded controller on the drive. You may have one that's marginal. You're really best off swapping in the spare for now, then pulling the troublesome drive and running Seagate SeaTools on it until it hopefully spots a problem and pops a result code out, at which point Seagate probably won't question it.

SMART shows the drive temperature at 36 degrees C.

I'll do the drive swap later.

Seagate's replacement policy is not quite as you describe; they will make a reasonable effort to replace your product with a refurbished unit of the same model. However, since the refurb process takes some substantial time, it is possible for them to run out of a given model, in which case they will often substitute a product that they consider "of equal or better value."

I was basing my claim on what happened when I RMA'd another ST32000641AS. It was a refurbished ST2000DM001 that they sent, and at that time the NewEgg reports on the latter drive were far worse -- and worse than the more recent ones.

If I can be blunt, worrying about a slight ratings difference between the legacy 2TB and the DM001 model (both four stars! but the legacy has 62% 5 star compared to the DM's 47% 5 star) is a silly thing. The competence of NewEgg reviewers is not sufficiently consistent for that to be meaningful in a way you can rely on. It can actually be kind of fun to rummage through the one-stars, I find myself going "Idiot... moron...." as I read some NewEgg 1-stars. Try it. You'll run across stuff like "My coffee cup won't sit flat on top of it" or "I can't figure out how to insert the disk into my floppy drive" mixed in with some of the more-reasonable complaints.

I agree that some of the NewEgg customer comments and ratings are ludicrous -- just like some of the ones on Amazon.

jgreco · Jun 29, 2013

Z300M said:
Non-ECC. It's an Asus F1A75-V Pro motherboard, AMD A4 3400 CPU, 16GB (2 sets of 2x4GB) G.Skill F3-12800CL9D-8GBRL RAM. No overclocking.

I would suggest running memtest86 just to be sure.

Hopefully the Seagate tool can locate a problem.

Z300M · Jun 29, 2013

jgreco said:
I would suggest running memtest86 just to be sure.

Hopefully the Seagate tool can locate a problem.

I haven't run memtest86 yet.

The first two runs of the Long Test in SeaTools for DOS quit at 5%, with gobbledegook on the screen and a high-pitched beep from the machine (the NAS machine), but I could not track it down; neither the fan speed nor the temperature warning lights on the drive cage changed from their usual green.

Same thing at 2% on the third run.

I now have it hooked up to a Win7 machine in an eSATA docking station. I think it's telling me that it's going to take 6hr 15min.

jgreco · Jun 29, 2013

Well, that suggests trouble. :(

Z300M · Jun 30, 2013

jgreco said:
Well, that suggests trouble. :(

BUT the long test on the Win7 machine with SeaTools for Windows completed without error.

And memtest86 ran on the FreeNAS machine without error.

Could it be that this particular drive has decided that it does not like something in the FreeNAS machine? Perhaps I should try changing the SATA speed to 3Gb/sec., even though it ran at 6Gb/sec. for almost two years.

jgreco · Jun 30, 2013

Suspect all the hardware, especially the power supply, etc. Sometimes the stuff does not last forever.

Z300M · Jun 30, 2013

jgreco said:
Suspect all the hardware, especially the power supply, etc. Sometimes the stuff does not last forever.

The power supply is a few years old, but voltages are within spec., even with many drives doing their thing simultaneously.

I replaced the old drive by the spare, but then when the Detach operation of the old one was still in progress the machine experienced a page fault and dropped back to a db> prompt at the console. Now that I have restarted the machine, the GUI shows only three of the four drives online, plus a Replace button four the fourth -- BUT the only drive it offers as a candidate is the one that had been replaced before: the "new fourth drive" (the former spare) is nowhere to be seen in the GUI.

cyberjock · Jun 30, 2013

It sounds almost like you had a 1 disk failure and a second disk had some corruption. This is why RAIDZ1 isn't recommended.

Do you do scrubs regularly?

Do you use a UPS to ensure the system shuts down cleanly on a loss of power?

Did you disable sync writes?

Long term, its looking like you will need to backup your zpool, destroy and recreate it to resolve your issues permanently.

At this point my questions are trying to determine how you got corruption, but if you have corruption. It seems that corruption is pretty much a guaranteed fact right now. What's important is that you figure out how/why you got the corruption so it doesn't happen again.

Edit: it is possible(but very unlikely) that your PSU is bad if the voltages are good. As a person that has created voltage regulator circuits from bare components you can have issues that aren't manifesting themselves except if the diodes get hot enough, etc. I'd more of put the PSU as a "back burner" consideration and continue troubleshooting elsewhere. What brand/model PSU do you have?

Z300M · Jun 30, 2013

cyberjock said:
It sounds almost like you had a 1 disk failure and a second disk had some corruption. This is why RAIDZ1 isn't recommended.

I'm in the process of trying to redo this whole system in Raidz2. I have most of the data backed up already.

Do you do scrubs regularly?

From time to time. The most recent was last week.

Do you use a UPS to ensure the system shuts down cleanly on a loss of power?

Yes.

Did you disable sync writes?

I didn't ENable them. Are they enabled by default?

Long term, its looking like you will need to backup your zpool, destroy and recreate it to resolve your issues permanently.

At this point my questions are trying to determine how you got corruption, but if you have corruption. It seems that corruption is pretty much a guaranteed fact right now. What's important is that you figure out how/why you got the corruption so it doesn't happen again.

Edit: it is possible(but very unlikely) that your PSU is bad if the voltages are good. As a person that has created voltage regulator circuits from bare components you can have issues that aren't manifesting themselves except if the diodes get hot enough, etc. I'd more of put the PSU as a "back burner" consideration and continue troubleshooting elsewhere. What brand/model PSU do you have?

The PSU is about seven years old -- an Antec Neo HE550. The UPS to which it's connected has a power output metering function, and I don't think I've ever seen it go over 100W, even with all drives spinning.

cyberjock · Jun 30, 2013

Sync writes are enabled by default. Disabling it can cause a substantial increase in performance, but at a significant cost. Zpool corruption can occur if the system experiences an unplanned shutdown or kernel panic.

Eh, 7 years is better than what I trust a PSU for. Typically about 5 years I won't trust a PSU in a new expensive build. It's really hard to say when a PSU will go bad. Some might only last a year or two, others more than a decade. I get very weary of old PSUs because of the overvolt your hardware, they can destroy every component in your computer at the same time. That makes for an expensive fix.

Your "suspect" hard drive may be suffering from silent corruption. You could use a program like Hard Disk Sentinel to do a write, read, verify test to see if the hard drive has problems. Unfortunately, you are kind of on your own to find the cause for your errors. From what I've read(and there's little documentation explaining this) write and read errors usually are from your hard drive telling the system that the write or read failed while checksum errors basically mean that the hard drive read data but by the time it got to the CPU to be processed it was corrupted. If you hadn't been using ZFS you would have corrupt data. It could be the hard drive, it could be the SATA cable, it could be flaky voltage getting to the hard drive, it could be that the data was stored in bad block in RAM, and many more places. Basically the entire data path is suspect between the CPU and the hard drive platter. Some of these you have ruled out already, but you are kind of on your own to figure out where the problem hardware is and get rid of it. :(

It's possible that if you do a write, read, verify test you may trigger the hard drive logic to realize that something is wrong. I gave up on Seagate years ago because of flaky firmware issues. You might be in that same bucket.

Good luck!

Z300M · Jul 6, 2013

cyberjock said:
Sync writes are enabled by default. Disabling it can cause a substantial increase in performance, but at a significant cost. Zpool corruption can occur if the system experiences an unplanned shutdown or kernel panic.

Eh, 7 years is better than what I trust a PSU for. Typically about 5 years I won't trust a PSU in a new expensive build. It's really hard to say when a PSU will go bad. Some might only last a year or two, others more than a decade. I get very weary of old PSUs because of the overvolt your hardware, they can destroy every component in your computer at the same time. That makes for an expensive fix.

Your "suspect" hard drive may be suffering from silent corruption. You could use a program like Hard Disk Sentinel to do a write, read, verify test to see if the hard drive has problems. Unfortunately, you are kind of on your own to find the cause for your errors. From what I've read(and there's little documentation explaining this) write and read errors usually are from your hard drive telling the system that the write or read failed while checksum errors basically mean that the hard drive read data but by the time it got to the CPU to be processed it was corrupted. If you hadn't been using ZFS you would have corrupt data. It could be the hard drive, it could be the SATA cable, it could be flaky voltage getting to the hard drive, it could be that the data was stored in bad block in RAM, and many more places. Basically the entire data path is suspect between the CPU and the hard drive platter. Some of these you have ruled out already, but you are kind of on your own to figure out where the problem hardware is and get rid of it. :(

It's possible that if you do a write, read, verify test you may trigger the hard drive logic to realize that something is wrong. I gave up on Seagate years ago because of flaky firmware issues. You might be in that same bucket.

Good luck!

A multimeter showed that the old PSU was within spec. but only just, and a PSU tester showed the 5V line to be low, so I bought a new one. But the problem with that particular drive remains -- and now I see that the drive from time to time simply quits responding, disappears even from the BIOS, and does not show up again until I cycle the power (easy to do because there is a power button for each drive in the cage). Yet still a long smartctl test shows no errors.

I have now successfully replaced the flaky drive by the spare. At least, I hope it is successful: zpool status shows the pool as healthy now, but copying a large quantity of data to it will be the test.

I bought HD Sentinel, and I'm glad I had found a 60%-off coupon, because I assumed that the "portable" version was going to allow me to create a bootable CD or thumb drive that I could use on the NAS machine itself; I did not realize that even the "portable" version is still a Windows-only program, meaning that any drive I want to test will have to be installed in (or at least connected to) a Windows machine.

cyberjock · Jul 6, 2013

Z300M said:
A multimeter showed that the old PSU was within spec. but only just, and a PSU tester showed the 5V line to be low, so I bought a new one. But the problem with that particular drive remains -- and now I see that the drive from time to time simply quits responding, disappears even from the BIOS, and does not show up again until I cycle the power (easy to do because there is a power button for each drive in the cage). Yet still a long smartctl test shows no errors.

Eh, cause for concern, but probably not the smoking gun.

Z300M said:
I have now successfully replaced the flaky drive by the spare. At least, I hope it is successful: zpool status shows the pool as healthy now, but copying a large quantity of data to it will be the test.

Good choice.

Z300M said:
I bought HD Sentinel, and I'm glad I had found a 60%-off coupon, because I assumed that the "portable" version was going to allow me to create a bootable CD or thumb drive that I could use on the NAS machine itself; I did not realize that even the "portable" version is still a Windows-only program, meaning that any drive I want to test will have to be installed in (or at least connected to) a Windows machine.

Yeah. I really wish there was a way to write a test pattern to the hard drive with the dd command, then read it back and compare the results with what should have been written. Unfortunately I haven't been able to find one. There's a "like" for anyone that provides such a command for FreeBSD or Linux :D

paleoN · Jul 6, 2013

cyberjock said:
Yeah. I really wish there was a way to write a test pattern to the hard drive with the dd command, then read it back and compare the results with what should have been written.

Code:

badblocks -svw -b 4096 -t 0xFF -t 0x00 -t 0xFF /dev/adaX

It's also been included on GParted Live for some time. Feel free to skip the like.

Z300M · Jul 6, 2013

cyberjock said:
Eh, cause for concern, but probably not the smoking gun.

I now have the misbehaving drive hooked up to a Windows7 machine via the eSATA docking station and have let HD Sentinel have at it. So far, under Overview I see "Problems occurred between the communication of the disk and the host 162 times" and under Alerts "Failure Predicted" followed by "Current Pending Sector Count" and "Reallocated Sector Count" -- but a short SMART test showed no errors. It's still performing yet another long test.

But if I can't get SMART to register some significant error (which even a long test hasn't so far), how do I get Seagate to replace it under warranty? Perhaps I have to give it a really strenuous workout and hope that it registers a more serious error.

cyberjock · Jul 7, 2013

If you are having communication errors those should show up in your SMART data. My first guess would be under UDMA_CRC_Error_Count.

Edit: I'm not sure I would have uses eSATA for the test. I've had problems with eSATA in the past. It can be implemented several different ways and depending on your motherboard manufacturer(assuming its onboard) can vary from "excellent" to "very poor".

Important Announcement for the TrueNAS Community.

Drive passed long SMART test but shows multiple checksum errors

Z300M

Guru

jgreco

Resident Grinch

Z300M

Guru

jgreco

Resident Grinch

Z300M

Guru

jgreco

Resident Grinch

Z300M

Guru

jgreco

Resident Grinch

Z300M

Guru

cyberjock

Inactive Account

Z300M

Guru

cyberjock

Inactive Account

Z300M

Guru

cyberjock

Inactive Account

paleoN

Wizard

Z300M

Guru

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

Drive passed long SMART test but shows multiple checksum errors

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Inactive Account

Guru

Inactive Account

Guru

Inactive Account

Wizard

Guru

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Drive passed long SMART test but shows multiple checksum errors"

Similar threads