Interpreting Badblocks Output

NannerMuffin · Feb 6, 2015

Good Morning All,

I just finished my new build and completed several days worth of CPU and memory testing. Two passes of short and long S.M.A.R.T. tests completed without errors.

Following qwertymodo's "[How To] Hard Drive Burn-In Testing" guide, I started a badblocks test last night. Two of my disks are showing massive "compare?" errors.

Code:

[admin@freenas /]$ sudo badblocks -ws /dev/ada0
Password:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
22.49% done, 10:31:41 elapsed. (0/0/46871888 errors)

[admin@freenas /]$ sudo badblocks -ws /dev/ada1
Password:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing:  52.04% done, 10:31:19 elapsed. (0/0/0 errors)

[admin@freenas /]$ sudo badblocks -ws /dev/ada2
Password:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing:  51.89% done, 10:30:51 elapsed. (0/0/0 errors)

[admin@freenas /]$ sudo badblocks -ws /dev/ada3
Password:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing:  49.78% done, 10:30:17 elapsed. (0/0/0 errors)

[admin@freenas /]$ sudo badblocks -ws /dev/ada4
Password:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
Reading and comparing:  47.66% done, 10:29:10 elapsed. (0/0/0 errors)

[admin@freenas /]$ sudo badblocks -ws /dev/ada5
Password:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done
50.08%  50.28% done, 10:28:28 elapsed. (0/0/5328896 errors)

I need to RMA drives ada0 & ada5, correct? I'll continue with the final long S.M.A.R.T. test per qwertymodo's guide, but I was wondering if a badblocks test will qualify for an RMA?

Build Specs:

MotherBoard: Supermicro X10SLH-F
CPU: Intel Core i3-4370
RAM: 2x 8GB Samsung 1600 MHz ECC M391B1G73QH0-YK0 (Supermicro X10SLH-F Certified)
HDD: 6x 3TB WD Reds WD30EFRX (RAIDZ2)
PSU: Seasonic SSR-360GP
UPS: CyberPower CP1500PFCLCD

Ericloewe · Feb 6, 2015

NannerMuffin said:
...I was wondering if a badblocks test will qualify for an RMA?

No, but errors logged by it should also show up in SMART.

NannerMuffin · Feb 6, 2015

Thanks.

Has anyone encountered errors on a badblocks test that was not reported on a S.M.A.R.T. test? Any reports of false positive errors, or is badblocks a bullet-proof test? Furthermore, is a short/long/conveyance S.M.A.R.T. test bullet-proof for determining the health of a disk?

Ericloewe · Feb 6, 2015

NannerMuffin said:
Thanks.

Has anyone encountered errors on a badblocks test that was not reported on a S.M.A.R.T. test?

It's possible, but unlikely (and I haven't seen that happen).

NannerMuffin said:
Any reports of false positive errors, or is badblocks a bullet-proof test? Furthermore, is a short/long/conveyance S.M.A.R.T. test bullet-proof for determining the health of a disk?

Badblocks can't distinguish between "that operation failed because of a bad block" and "that operation failed because of a random bitflip caused by a cosmic ray". The latter doesn't have much of a bearing on the drive's life. SMART should show the sector as uncorrectable in the first case, and in tha latter case just log an increased Read Error Rate or similar. That said, nothing is really bulletproof - but SMART helps a lot.
Think of badblocks more like a stress test, not an integrity check.

cyberjock · Feb 6, 2015

Ericloewe said:
Badblocks can't distinguish between "that operation failed because of a bad block" and "that operation failed because of a random bitflip caused by a cosmic ray". The latter doesn't have much of a bearing on the drive's life. SMART should show the sector as uncorrectable in the first case, and in tha latter case just log an increased Read Error Rate or similar. That said, nothing is really bulletproof - but SMART helps a lot.
Think of badblocks more like a stress test, not an integrity check.

This isn't correct. ;)

Badblocks simply does a write/read/verify test. Nothing else. If "what goes out isn't what comes in" then it reports an error. In essence, the data in that location (or the data in transit at that moment) was not accurate.

Hard drives have their own internal ECC. For 4k drives this is an additional 100 bytes. That typical 4k sector you think stores data is actually 4211 bytes. The rest is for other overhead internal to the drive and is unusable disk space for you. If there was a single bit-flip from cosmic radiation on the platter itself, the disk would read the error and correct it using its own internal ECC. You, as the end user, would be totally unaware of this correction (and believe it or not this correction happens so frequently you'd be astonished how often it happens). Some Seagate disks report the errors as a SMART parameter, and the error rate increasing rapidly can be in indicator of a disk that is beginning to fail.

Anyway, if the disk can't correct the error because too many bytes are corrupted (or the ECC happens to still match and you have "silent corruption") then you get various ranges of errors from nothing at all to silent corruption. I won't hypothesize what those scenarios are because the disks themselves and how they handle the error condition can change from firmware revision to firmware revision.

To get back on-topic, badblocks simply does a write/read/verify test. If the test pattern isn't read back you'll get a reported error in badblocks. It's your job as the administrator to determine what the cause was and to take action (if any). If you were to be doing the read test on a disk and you unplugged it, you'd get a crapload of read errors. Hopefully you'd know better and not do that, or you'd know you need to plug the disk back in and perform the test again.

Ericloewe · Feb 6, 2015

cyberjock said:
This isn't correct. ;)

Badblocks simply does a write/read/verify test. Nothing else. If "what goes out isn't what comes in" then it reports an error. In essence, the data in that location (or the data in transit at that moment) was not accurate.

Hard drives have their own internal ECC. For 4k drives this is an additional 100 bytes. That typical 4k sector you think stores data is actually 4211 bytes. The rest is for other overhead internal to the drive and is unusable disk space for you. If there was a single bit-flip from cosmic radiation on the platter itself, the disk would read the error and correct it using its own internal ECC. You, as the end user, would be totally unaware of this correction (and believe it or not this correction happens so frequently you'd be astonished how often it happens). Some Seagate disks report the errors as a SMART parameter, and the error rate increasing rapidly can be in indicator of a disk that is beginning to fail.

Anyway, if the disk can't correct the error because too many bytes are corrupted (or the ECC happens to still match and you have "silent corruption") then you get various ranges of errors from nothing at all to silent corruption. I won't hypothesize what those scenarios are because the disks themselves and how they handle the error condition can change from firmware revision to firmware revision.

To get back on-topic, badblocks simply does a write/read/verify test. If the test pattern isn't read back you'll get a reported error in badblocks. It's your job as the administrator to determine what the cause was and to take action (if any). If you were to be doing the read test on a disk and you unplugged it, you'd get a crapload of read errors. Hopefully you'd know better and not do that, or you'd know you need to plug the disk back in and perform the test again.

Poor choice of words, I'll admit - the pedant in me is taking a vacation :p

I'll subscribe to your much more accurate explanation.

NannerMuffin · Feb 6, 2015

cyberjock said:
If you were to be doing the read test on a disk and you unplugged it, you'd get a crapload of read errors. Hopefully you'd know better and not do that, or you'd know you need to plug the disk back in and perform the test again.

No disks were unplugged during the test, so 46 million errors on ada0 & 5 million on ada5 is very concerning. These are brand-spanking-new disks handled with TLC (at least by me :))!

Badblocks phase #2 kicked off a couple of hours ago and the errors haven't increased. The phase #1 test had all errors reported by the time it reached 35%. If the errors were generated by truly bad sectors, I should expect the error count to double by the time phase #2 reaches 35%, right? Should I wait for all 4 phases to complete, or should I kill the tests and ada0 & ada5 to run a long S.M.A.R.T. test instead?

cyberjock · Feb 6, 2015

The SMART test is only a read test. It may or may not even check the CRCs of the sectors, so its definitely not as thorough of a test as a badblocks test.

NannerMuffin · Feb 6, 2015

Phase #2 just surpassed 50% on ada5.

No increase in errors from phase #1.

Code:

51.41% done, 17:53:53 elapsed. (0/0/5328896 errors)

NannerMuffin · Feb 6, 2015

And now phase #2 just surpassed 22% on ada0.

No increase in errors from phase #1.

Code:

24.74% done, 18:33:26 elapsed. (0/0/46871888 errors)

What gives? Gremlins in my HDD?

cyberjock · Feb 6, 2015

So it probably wasn't the disk. But that begs the question "what was it?"

NannerMuffin · Feb 6, 2015

Still no errors incrementing, and it is on phase 3 of the badblocks test.

cyberjock said:
So it probably wasn't the disk. But that begs the question "what was it?"

I agree that it is probably not the drives. The system was not moved, touched, breathed on, or even looked at for the duration of testing, so that is alot of cosmic rays!

I am going let all 4 phases of badblocks complete and run a long S.M.A.R.T. test.

Stay tuned...

NannerMuffin · Feb 7, 2015

Oh boy, I am learning quickly about badblocks...

I did not realize the standard badblocks test is comprised of four 2-part phases. The first part of each phase being the write test, and the second part being the read & verify test. Each phase is simply a different bit pattern.

NannerMuffin said:
And now phase #2 just surpassed 22% on ada0.
No increase in errors from phase #1.

Code:
24.74% done, 18:33:26 elapsed. (0/0/46871888 errors)

What gives? Gremlins in my HDD?

This part of the phase was only the write test. The second read & verify test completed overnight and the errors doubled on ada0.

Code:

16.60% done, 34:24:34 elapsed. (0/0/95172957 errors)

NannerMuffin said:
Phase #2 just surpassed 50% on ada5.
No increase in errors from phase #1.

Code:
51.41% done, 17:53:53 elapsed. (0/0/5328896 errors)

However, this is not the case on ada5. No increase in errors.

Code:

74.80% done, 34:21:02 elapsed. (0/0/5328896 errors)

Time to kill the badblocks tests on ada0 & ada5 to switch to long S.M.A.R.T. tests.

Ericloewe · Feb 7, 2015

NannerMuffin said:
Oh boy, I am learning quickly about badblocks...

I did not realize the standard badblocks test is comprised of four 2-part phases. The first part of each phase being the write test, and the second part being the read & verify test. Each phase is simply a different bit pattern.

This part of the phase was only the write test. The second read & verify test completed overnight and the errors doubled on ada0.

Code:
16.60% done, 34:24:34 elapsed. (0/0/95172957 errors)

However, this is not the case on ada5. No increase in errors.

Code:
74.80% done, 34:21:02 elapsed. (0/0/5328896 errors)

Time to kill the badblocks tests on ada0 & ada5 to switch to long S.M.A.R.T. tests.

Ah, yes. The "This thing takes *how* long???" phase everyone goes through the first time they run badblocks.

cyberjock · Feb 7, 2015

Hey, when I did badblocks on my 6TB drives it took something like 30 hours per pass!

NannerMuffin · Feb 7, 2015

A long S.M.A.R.T. test completed without errors on ALL drives. What gives.....?

Is this a first? Massive error count on badblocks, but a clean bill of health from S.M.A.R.T.

What am I missing?

Code:

[admin@freenas] /% sudo smartctl -l selftest /dev/ada0
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  77  -
# 2  Conveyance offline  Completed without error  00%  70  -
# 3  Short offline  Completed without error  00%  70  -
# 4  Conveyance offline  Completed without error  00%  67  -
# 5  Extended offline  Completed without error  00%  28  -
# 6  Short offline  Completed without error  00%  8  -
# 7  Extended offline  Completed without error  00%  7  -
# 8  Short offline  Completed without error  00%  0  -

[admin@freenas] /% sudo smartctl -l selftest /dev/ada1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  77  -
# 2  Conveyance offline  Completed without error  00%  70  -
# 3  Short offline  Completed without error  00%  70  -
# 4  Conveyance offline  Completed without error  00%  67  -
# 5  Extended offline  Completed without error  00%  28  -
# 6  Short offline  Completed without error  00%  8  -
# 7  Extended offline  Completed without error  00%  7  -
# 8  Short offline  Completed without error  00%  0  -

[admin@freenas] /% sudo smartctl -l selftest /dev/ada2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  77  -
# 2  Conveyance offline  Completed without error  00%  70  -
# 3  Short offline  Completed without error  00%  70  -
# 4  Conveyance offline  Completed without error  00%  67  -
# 5  Extended offline  Completed without error  00%  28  -
# 6  Short offline  Completed without error  00%  8  -
# 7  Extended offline  Completed without error  00%  7  -
# 8  Short offline  Completed without error  00%  0  -

[admin@freenas] /% sudo smartctl -l selftest /dev/ada3
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  77  -
# 2  Conveyance offline  Completed without error  00%  70  -
# 3  Short offline  Completed without error  00%  70  -
# 4  Conveyance offline  Completed without error  00%  67  -
# 5  Extended offline  Completed without error  00%  28  -
# 6  Short offline  Completed without error  00%  8  -
# 7  Extended offline  Completed without error  00%  7  -
# 8  Short offline  Completed without error  00%  0  -

[admin@freenas] /% sudo smartctl -l selftest /dev/ada4
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  77  -
# 2  Conveyance offline  Completed without error  00%  70  -
# 3  Short offline  Completed without error  00%  70  -
# 4  Conveyance offline  Completed without error  00%  67  -
# 5  Extended offline  Completed without error  00%  28  -
# 6  Short offline  Completed without error  00%  8  -
# 7  Extended offline  Completed without error  00%  7  -
# 8  Short offline  Completed without error  00%  0  -

[admin@freenas] /% sudo smartctl -l selftest /dev/ada5
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  77  -
# 2  Conveyance offline  Completed without error  00%  70  -
# 3  Short offline  Completed without error  00%  70  -
# 4  Conveyance offline  Completed without error  00%  67  -
# 5  Extended offline  Completed without error  00%  28  -
# 6  Short offline  Completed without error  00%  8  -
# 7  Extended offline  Completed without error  00%  7  -
# 8  Short offline  Completed without error  00%  0  -

cyberjock · Feb 7, 2015

No, that just means the problem isn't your disks....

That's why I said earlier

cyberjock said:
So it probably wasn't the disk. But that begs the question "what was it?"

Basically you have no data to conclusively prove what the problem was. So you can trust the server and assume it was a fluke, or you can choose not to trust the server and run more tests.

I'm not one to believe in flukes as a course of business.

NickB · Feb 7, 2015

I recently ran into a very similar issue with that same drive (wd red 3 tb drives) that I bought in early January. One was doa (just clicked). Another reported lots of errors when FreeNAS was doing the initial post.

I used the wd drive diagnostic tool and the drive passed. I reviewed the smart info and everything looked good.

When I started doing data copies to FreeNAS, the drive would get read errors and was disconnected. After a reboot, the drive came back, but world fail again during a scrub.

I switched the sata port with a drive that was working, and the drive was still disconnected. Switched it back again to be sure and same thing.

Smart still didn't show any issues. Long story short, I did an rma and haven't had any issues since.

nick779 · Feb 10, 2015

Just throwing this out there, you dont have a pool on those disks yet correct?
I had the same issue you did and found out that it was the .system writing to the disks behind my back. I destroyed the pool and reran badblocks and the compare errors totally went away.

NannerMuffin · Feb 10, 2015

nick779 said:
Just throwing this out there, you dont have a pool on those disks yet correct?

Correct. There is no pool on the disks. These are brand new disks that have not been used in a pool yet.

cyberjock said:
No, that just means the problem isn't your disks....

That's why I said earlier

Basically you have no data to conclusively prove what the problem was. So you can trust the server and assume it was a fluke, or you can choose not to trust the server and run more tests.

To rule out cabling or controller port issues on ada0, I flipped the SATA ports on ada0 & ada1. Badblocks continued to throw errors on the same disk. I RMA'd the disk, flipped the ports back to their original state, and there are no longer badblocks errors after 30+ hours of testing. Seems the disk was the issue. I had assumed this was the case originally, since the number of errors from test 1 (pattern 0xaa) doubled after test 2 (pattern 0x55).

I repeated the same steps for ada5. Flipped the SATA ports on ada4 & ada5. No errors were thrown! Flipped the ports back to their original state and still no errors after 30+ hours of testing! Woot! It must have been a poorly seated cable. I was less suspect of a bad disk since the number of errors from test 1 (pattern 0xaa) not not increase after test 2 (pattern 0x55) on the original test.

Important Announcement for the TrueNAS Community.

Interpreting Badblocks Output

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Inactive Account

Server Wrangler

Dabbler

Inactive Account

Dabbler

Dabbler

Inactive Account

Dabbler

Dabbler

Server Wrangler

Inactive Account

Dabbler

Inactive Account

Dabbler

Contributor

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Interpreting Badblocks Output"

Similar threads