Disk Failure or other issue?

Matt84 · Jan 28, 2017

I have a strange disk issue with a single disk volume where it is showing as degraded but the SMART stats aren't reporting any errors.

My system consists of:

Lian Li PC-Q26
Seasonic SS-660XP2
SUPERMICRO MBD-X10SDV-TLN4F-O
SAMSUNG 32GB 288-Pin DDR4 SDRAM Registered DDR4 2133 (M393A4K40BB0-CPB0) x2
LSI LSI00301 (9207-8i) PCI-Express 3.0 x8 Low Profile SATA / SAS Host Controller Card
WD40EFRX (4 disks in RAIDz1 - imported from my old N36L NAS4Free box that has been retired - 2 attached to LSI HBA and 2 attached to Intel SATA)
ST8000AS0002 (6 disks in a RAIDz2 - attached to LSI HBA)
SanDisk Ultra Fit USB (single drive FreeNAS Boot)
WD7500BPVT (1 disk - temporary storage, VBox jail, Plex jail, nothing I can't recreate - the disk with the issue - attached to Intel SATA)

SMART statistics are:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  183  151  021  Pre-fail  Always  -  1816
  4 Start_Stop_Count  0x0032  091  091  000  Old_age  Always  -  9388
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  4060
 10 Spin_Retry_Count  0x0032  100  100  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  099  099  000  Old_age  Always  -  1461
191 G-Sense_Error_Rate  0x0032  001  001  000  Old_age  Always  -  261377
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  50
193 Load_Cycle_Count  0x0032  138  138  000  Old_age  Always  -  186554
194 Temperature_Celsius  0x0022  113  103  000  Old_age  Always  -  34
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed without error  00%  4057  -

The latest results from a scrub on the volume are:

Code:

[root@freenas] ~# zpool status -v vm
  pool: vm
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
  corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
  entire pool from backup.
  see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 1h29m with 9 errors on Sun Jan 29 14:31:33 2017
config:

  NAME  STATE  READ WRITE CKSUM
  vm  DEGRADED  0  0 7.17K
  gptid/b058e8a8-86fc-11e6-893a-0cc47aca83c4  DEGRADED  0  0 14.3K  too many errors

errors: Permanent errors have been detected in the following files:

  /var/db/system/rrd-7f4d67ae16c94917b949456bb9f364ad/localhost/df-mnt-vm/df_complex-used.rrd
  /mnt/vm/jails/VirtualMachines/usr/home/vbox/VirtualBox VMs/SQL Server/SQL Server.vdi
  /mnt/vm/jails/VirtualMachines/usr/home/vbox/VirtualBox VMs/Windows 10 Evaluation/Windows 10 Evaluation.vdi
  /mnt/vm/jails/VirtualMachines/usr/home/vbox/VirtualBox VMs/GitLab/GitLab.vdi
  /mnt/vm/jails/VirtualMachines/usr/home/vbox/VirtualBox VMs/Visual Studio 2017 Test/Visual Studio 2017 Test.vdi

I want to stress that I don't care about the data on this drive as I can easily re-create it. I know a single disk volume is bad for this exact reason. I didn't have the money for an extra two decent SSDs to use as my VM store at the time I built this machine so I used a spare laptop disk that hadn't really bee used to much as it was swapped out for an SSD years ago. I have offline backups of my other two volumes.

What I am trying to figure out what is causing the corruption? When I have had scrub errors in the past, I also had SMART stats showing reallocated sectors and pending sectors, but in this case from what I can tell the SMART stats are fine, and the drive passed the long test. My memory is ECC and the BIOS is not reporting any errors. Could it be the Intel SATA controller, but two of my 4TB drives are also attached to it and they aren't manifesting any errors. I'm now suspecting the SATA cable; is there any way to prove / disprove this other than changing the cable and assuming fixed until the issue comes back?

My case as room for 10 3.5" drives and a single 2.5" drive. I'm almost in a position where I can replace this drive and I am thinking of getting a 500GB 850 EVO SATA and an 500GB 850 EVO M.2 as my board does have an M.2 slot. Would there be any issues in mirroring a SATA drive and an M.2 drive considering they are of the same generation?

Any ideas would be appreciated.

joeschmuck · Jan 29, 2017

I don't think anyone would be able to diagnose this problem with 100% certainty but I suspect that the system was powered down incorrectly and the files never closed properly.

I think you already know that you have issues with using a single laptop hard drive in this capacity. What I see when I look at the SMART data is a drive that has experienced a lot of G-force issues and it's spinning up and down an aweful lot.

Could it be the SATA cable, it could be and that is the first thing to replace however I suspect it's the drive since it was not designed for this type of use.

I can't speak to the M.2 slot question except to caution you to ensure that the M.2 slot is a true free port and that when you use it that it doesn't disable one of the SATA motherboard ports. Read the user manual.

rs225 · Jan 29, 2017

Run a memory test, then check/swap cables. Finally, consider whether your power supply is under spec, or failing, or whether too many devices are on a single branch. Finally, suspect the drive.

For testing, you can set copies=2 on a special test dataset and see if the problem becomes less severe. Or, upgrade the vm pool to mirrors, and see if errors hit both drives, which would help confirm a system problem.

DrKK · Jan 29, 2017

I agree with @joeschmuck. I am very troubled by attribute #191 essentially being pinned. Also, note the drive has experienced approximately 200000 loading cycles. That's quite a lot, and an *insane* amount for a drive with only 4000'ish hours of service. I have some laptop drives here with many hundreds of thousands of load cycles, but they've got a lot more hours on them, and I consider them "scratch" drives, i.e., not suitable for actually storing data that I give a crap about, and I certainly wouldn't put them into service in a file server.

I guess what I am saying is, your energy to figure out what's going on here is probably wasted, since you shouldn't be using this drive anyway in a NAS, even for data I did not care about.

Matt84 · Jan 30, 2017

Thanks guys, I've came to the same conclusion that this drive / cable is not worth the effort. I checked my motherboard manual and the M.2 slot shares SATA-0, but that is okay as I have 6 SATA ports but only 3 of them used. I have two Samsung 850 EVOs; one SATA and one M.2 on they way which I will mirror. I will switch out the SATA cable also just to be sure.

The system has been up for ages so I doubt the issue was caused by a power failure as the previous fortnightly scrub picked up no errors. The load cycle count reminds me of the WD Green drives I used in my NAS years ago which required IDLE3 adjustment to prevent this. I only thought it was the green drives that were affected but since googling my drive model it seems the WD Blue laptop drives also exhibit this behavior. The g shock sensor I have no idea about as the drive is secured to the bottom of the case in the 2.5" slot and the server does not get moved when powered on. The only thing I can think of here is that the damage was done whilst the drive was still it the laptop it came with.

Let this be a lesson to anyone wanting to use cheap laptop drives in a NAS. If you are storing anything you want to keep; just don't!

Matt84 · Feb 4, 2017

The two Samsung 850s have arrived and are installed and working well. Running my VMs off of the SSD mirror is so much faster. A reinstall of my Plex and VirtualBox jails didn't take too long so everything is now fine.

It turned out to be the drive after all after running tests in an USB dock in another PC; IO errors everywhere.

Thanks for your help guys.

Important Announcement for the TrueNAS Community.

Disk Failure or other issue?

Matt84

Dabbler

joeschmuck

Old Man

rs225

Guru

DrKK

FreeNAS Generalissimo

Matt84

Dabbler

Matt84

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Disk Failure or other issue?

Matt84

Dabbler

joeschmuck

Old Man

rs225

Guru

DrKK

FreeNAS Generalissimo

Matt84

Dabbler

Matt84

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk Failure or other issue?"

Similar threads