Faulty disk working fine in windows

kaspernordmark

Dabbler
Joined
Jan 5, 2015
Messages
23
I had a 3tb drive fail on me some year ago.
For some reason I keept it and when i put it in my windows 10 box it was working just fine. Passes all S.M.A.R.T checks i do and all other checks.

How could this be that (back then freenas) TrueNAS marked the drive as faulty?
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
ZFS is pretty aggressive at failing out disks. Due to CoW nature, it's constantly checking the checksums of blocks as they are read, and looking for signs of corruption. Its possible for drives to start returning bad blocks without failing SMART tests. Without knowing more details, hard to say in this specific situation.
 

kaspernordmark

Dabbler
Joined
Jan 5, 2015
Messages
23
ZFS is pretty aggressive at failing out disks. Due to CoW nature, it's constantly checking the checksums of blocks as they are read, and looking for signs of corruption. Its possible for drives to start returning bad blocks without failing SMART tests. Without knowing more details, hard to say in this specific situation.

Ah okay, I was thinking something along those lines.
I'm currently looking through my email alerts to see if i can find any details.

"The volume Filmer state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected."


Code:
pool: Filmer

state: DEGRADED

status: One or more devices has experienced an unrecoverable error.  An

        attempt was made to correct the error.  Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

        using 'zpool clear' or replace the device with 'zpool replace'.

   see: http://illumos.org/msg/ZFS-8000-9P

  scan: scrub repaired 0 in 0 days 04:34:18 with 0 errors on Mon Nov 19 07:34:26 2018

config:



        NAME                                            STATE     READ WRITE CKSUM

        Filmer                                          DEGRADED     0     0     0

          raidz1-0                                      DEGRADED     0     0     0

            gptid/fa81c4a1-3641-11e6-be91-90e6ba2e5486  ONLINE       0     0     0

            gptid/fb3b190c-3641-11e6-be91-90e6ba2e5486  ONLINE       0     0     0

            gptid/fbdb72f9-3641-11e6-be91-90e6ba2e5486  DEGRADED     0     0   318  too many errors



errors: No known data errors



-- End of daily output -- 



So it appear that it was checksum errors. I think i did that Zpool clear but the errors came back ant that's when i replaced the drive.
 

Kris Moore

SVP of Engineering
Administrator
Moderator
iXsystems
Joined
Nov 12, 2015
Messages
1,471
Yea, you did the right thing in that case. That's just ZFS warning you that its getting faulty blocks from the disk, so I'd treat it as suspect unless you can confirm it was something else, like a bad SATA cable or similar.
 

kaspernordmark

Dabbler
Joined
Jan 5, 2015
Messages
23
Yea, you did the right thing in that case. That's just ZFS warning you that its getting faulty blocks from the disk, so I'd treat it as suspect unless you can confirm it was something else, like a bad SATA cable or similar.

Nah it was the drive. everything went back to normal after i switched the drive. Ended up going for 8tb ones while i was at it.
I just find it weird that no tools in windows can find anything wrong?


Also found this.


Code:
pool: Filmer
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Wed Dec  5 03:00:14 2018
        78.8G scanned at 1.71G/s, 696K issued at 15.1K/s, 7.27T total
        1.25M repaired, 0.00% done, no estimated completion time
config:

        NAME                                            STATE     READ WRITE CKSUM
        Filmer                                          DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/fa81c4a1-3641-11e6-be91-90e6ba2e5486  ONLINE       0     0     0
            gptid/fb3b190c-3641-11e6-be91-90e6ba2e5486  ONLINE       0     0     0
            gptid/fbdb72f9-3641-11e6-be91-90e6ba2e5486  DEGRADED     0     0   717  too many errors  (repairing)

errors: No known data errors



Code:
scan: scrub repaired 133G in 0 days 05:40:33 with 0 errors on Wed Dec  5 08:40:47 2018
config:

        NAME                                            STATE     READ WRITE CKSUM
        Filmer                                          DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/fa81c4a1-3641-11e6-be91-90e6ba2e5486  ONLINE       0     0     0
            gptid/fb3b190c-3641-11e6-be91-90e6ba2e5486  ONLINE       0     0     0
            gptid/fbdb72f9-3641-11e6-be91-90e6ba2e5486  DEGRADED     0     0 2.14M  too many errors
 

kaspernordmark

Dabbler
Joined
Jan 5, 2015
Messages
23
I was able to find this:
When a drive writes a sector to the platters, it doesn't just write the bits in the same way they're stored in RAM - it uses an encoding to make sure there aren't any sequences of the same bit that are too long, and it adds ECC codes, that allow it to repair errors that affect a few bits, and detect errors that affect more than a few bits.

When the drive reads the sector, it checks these ECC codes, and repairs the data, if neccesary and possible. What happens next depends on the circumstances and the firmware of the drive, which is influenced by the designation of the drive.

  • If a sector can be read and has no ECC problems, it's passed to the OS
  • If a sector can be repaired easily, the repaired version may be written to disk, read back, and verified, to determine if the error was a random one (cosmic rays ...) or if there is a systematic error with the media
  • If the drive determines there is an error with the media, it reallocates the sector
  • If a sector can be neither read nor corrected after a few read attempts, on a drive that's designated as a RAID drive, the drive will give up, reallocate the sector, and tell the controller there was a problem. It relies on the RAID controller to reconstruct the sector from the other RAID members, and write it back to the failed drive, which then stores it in the reallocated sector that hopefully doesn't have the problem.
  • If a sector can't be read or corrected on a desktop drive, the drive will do a lot more attempts to read it. Depending on the quality of the drive, this might involve repositioning the head, checking if there are any bits that flip when read repeatedly, checking which bits are the weakest, and a few other things. If any of these attempts succeed, the drive will reallocate the sector and write back the repaired data.
(This is one of the main differences between drives that are sold as "Desktop", "NAS/RAID" or "Video surveillance" drives. A RAID drive can just give up quickly and make the controller repair the sector to avoid latency on the user side. A desktop drive will retry again and again, because having the user wait a few seconds is probably better than telling them the data is lost. And a Video drive values constant data rate more than error recovery, as a damaged frame typically won't even be noticed.)

Anyway, the drive will know if there has been bit rot, will typically recover from it, and if it can't, it will tell the controller which will in turn tell the driver which will tell the OS. Then, it's up to the OS to present this error to the user and act on it. This is why cybernard says

I have never witnessed a single bit error myself, but I have seen plenty of hard drives where entire sectors have failed.
the drive will know there's something wrong with the sector, but it doesn't know which bits have failed. (One single bit that has failed will always be caught by ECC).

Please note that chkdsk, and automatically repairing filesystems, do not address reparing data within files. Those are targeted at corruption withing the structure of the filesystem; like a file size being different between the directory entry and the number of allocated blocks. The self-healing feature of NTFS will detect structural damages and prevent them from affecting your data further, they will not repair any data that is already damaged.

There are, of course, other reasons why data may become damaged. For example. bad RAM on a controller may alter data before it's even sent to the drive. In that case, no mechanism on the drive will detect or repair the data, and this may be one reason why the structure of a filesystem is damaged. Other reasons include plain software bugs, blackout while writing the disk (although this is addressed by filesystem journaling), or bad filesystem drivers (the NTFS driver on Linux defaulted to read-only for a long time, as NTFS was reverse engineered, not documented, and the developers didn't trust their own code).

I had this scenario once, where an application would save all its files to two different servers in to different data centres, to keep a working copy if the data under all circumstances. After a few months, we noticed that on one of the copies, about 0.1% of all files didn't match the MD5 sum that the application stored in its database. Turned out to be a faulty fiber cable between the server and the SAN.
These other reasons are why some filesystems, like ZFS, keep additional checksum information to detect errors. They're designed to protect you from a lot more things that can go wrong than just bit rot.

Yet another reason I'm so glad I went with TrueNAS and ZFS
 

kaspernordmark

Dabbler
Joined
Jan 5, 2015
Messages
23
Good find on that write-up, and yes TrueNAS and ZFS is amazing ;)
Absolutly ;)

Segates own drive check tool was actually able to detect a problem after running a long test. however the warranty expired in 2016. so i guess this was one of my first drives dating back to my first freenas build in 2014. survived 34099 hours though.

Anyway I'm gonna use the repair function and then use it for my steam library and see how much longer it lasts :wink:

Segate drive failure.png

crystal disk info segate drive failure.png
 

kaspernordmark

Dabbler
Joined
Jan 5, 2015
Messages
23
You've got a running 3TB Seagate drive? Talk about a rare specimen; those drives were well-known for their failure rates.

Not only have i one. i have 4 running in a raidz1. They were the by far cheapest per gigabyte back in 2014-2015 so that's why i went for those. one of them was replaced under warranty. and this one faild back in december 2018. but i guess the other four won't last much longer.

(don't diss my 250gb Maxtor IDE drive :wink: It's a workhorse running all plugins.)

TrueNAS drive ID.png
 
Last edited:
Top