Degraded Volume,discuss before I replace disk

JayG30 · Jul 28, 2015

Hello,

Today I noticed one of my freenas servers was in a degraded state. I found out a bit late it seems because my email moved the messages to "clutter" (sigh). Anyway, I'm just trying to determine if anyone might see something other than a disk issue.

When I logged into the web GUI (and the CLI zpool status initially) the disk had a few hundred write errors showing for that disk.

In dmesg it showed;

Code:

		(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 555 command timeout cm 0xffffff8000b02718 ccb 0xfffffe004101f000
		(noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000b02718
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 50 00 00 40 00 length 32768 SMID 337 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 10 00 00 40 00 length 32768 SMID 363 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 d0 00 00 40 00 length 32768 SMID 841 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 90 00 00 40 00 length 32768 SMID 220 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 50 00 00 40 00 length 32768 SMID 748 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 10 00 00 40 00 length 32768 SMID 321 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 d0 00 00 40 00 length 32768 SMID 515 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 90 00 00 40 00 length 32768 SMID 745 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 50 00 00 40 00 length 32768 SMID 868 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 8e a0 00 00 40 00 length 32768 SMID 632 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 466 terminated ioc 804b scsi 0 state c xfer 0
mps0: IOCStatus = 0x4b while resetting device 0xf
(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da5:mps0:0:13:0): CAM status: Command timeout
(da5:mps0:0:13:0): Retrying command
da5 at mps0 bus 0 scbus0 target 13 lun 0
da5: <ATA TOSHIBA MG03ACA3 FL1A> s/n			53K7K7JPF detached
(da5:mps0:0:13:0): Periph destroyed

The volume initially showed the disk as unavailable;

Code:

[root@freenas] ~# zpool status -v store
  pool: store
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h26m with 0 errors on Sun Jul 19 00:26:39 2015
config:

		NAME											STATE	 READ WRITE CKSUM
		store										   DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/1c383e96-d315-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/90b50eaf-d315-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/284a6fc3-d316-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/c66e0391-d317-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/14a02475-d318-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			559548462891584750						  UNAVAIL	  3   246	 0  was /dev/gptid/5178ef38-d319-11e4-98c7-0cc47a335ac4

I rebooted the server but no change.

So I had someone on site remove the disk for me, 1 to get the S/N and second to see if I could online it and have it rebuild itself. After removing it the status of the disk changed to "removed". Subsequent reboots of the server have made the volume show as "resilvering" but the disk never came online, even after trying to force it online through zfs online command.

The disk is now back to showing "unavailable" as shown above.
EDIT: it seems the disk is now back to showing "removed". It seems to be unavailable on initial reboot while it resilvers and then goes to "removed".

Further more I can't even see the disk in smartctl. It just seems like it is being removed per the dmesg shown above, "(da5:mps0:0:13:0): Periph destroyed". I had hoped to try to check the smartctl readings, but can't since the disk isn't showing up at all.

My gut says the disk went bad. I filed a RMA for it and will go down to check it tomorrow. But perhaps someone might have an idea.

JayG30 · Jul 28, 2015

Some more information that I'm sure people will be looking for.
The disks are all the same make/model.
The server was put together in Late March/Early April.
The server was stress tested using the scripts jgreco posted in these forums somewhere. Had no issues.
This is the first real issue I've had with it.

Specifications;

Code:

Case: SuperMicro CSE-826E16-R1200LPB
Backplane: BPN-SAS2-826EL1
Motherboard: SUPERMICRO MBD-X10SL7-F-O
HBA: onboard LSI 2308 (firmware P16, as recommended by FreeNAS)
CPU: Intel Xeon E3-1231V3
RAM: Crucial CT2KIT102472BD160B (2 x 8GB)
HDD: 6 x Toshiba MG03ACA300 3TB Enterprise SATA
Norcoo SFF8087 reverse breakout cable

The errors at the top of dmesg look a lot like the ones in THIS thread.

Robert Trevellyan · Jul 28, 2015

JayG30 said:
(da5:mps0:0:13:0): Periph destroyed

This is the message you would get if you pulled the plug on that drive, therefore, my guess is the drive is essentially dead.

DrKK · Jul 28, 2015

Yes, the drive in question, has joined the choir invisible, and needs to be replaced, sir.

anodos · Jul 29, 2015

DrKK said:
Yes, the drive in question, has joined the choir invisible, and needs to be replaced, sir.

It's probably pining for the fjords.

Robert Trevellyan · Jul 29, 2015

It is an ex-drive.

JayG30 · Jul 29, 2015

lol glad you guys had some fun with your replies. :)
I figured it had gone bad. Surprisingly I've had very few bad drives over the years.
I figured I'd just check if anyone had anything to add in case it might be a backplane, expander, etc issue.
I'm going down now to check it out and if necessary remove the drive and prepare for RMA shipment.
It's in a RAIDZ2 and used for 2nd tier backups currently so not to concerned to be honest.

JayG30 · Jul 29, 2015

Well I took the disk out of the server and plugged it into my laptop using one of those multi-format USB adapter things I have.
As the disk started up I heard a LOT of noise (and it doesn't go away), which obviously was the first bad sign. It booted but being originally setup in FreeNAS windows can't see it. Loaded up magic partition to see if I could wipe it out. The disk shows only ~746GB (3TB disk). Obviously something wrong. But I was able to format that 746GB of space and use CrystalDisk to check SMART info.
Health Status: Caution
Reallocated Sectors Count: 100 100 50
Current Pending Sector Count: 100 100 0

So looks like disk went bad. Only 6553 hours it seems, not good.

Code:

----------------------------------------------------------------------------
CrystalDiskInfo 6.5.2 (C) 2008-2015 hiyohiyo
                                Crystal Dew World : http://crystalmark.info/
----------------------------------------------------------------------------
-- Disk List ---------------------------------------------------------------
(2) TOSHIBA MG03ACA300 : 3000.5 GB [1/X/X, jm1] (V=152D, P=2338)
----------------------------------------------------------------------------
(2) TOSHIBA MG03ACA300
----------------------------------------------------------------------------
       Enclosure : TOSHIBA MG03ACA300 USB Device (V=152D, P=2338, jm1)
           Model : TOSHIBA MG03ACA300
        Firmware : FL1A
   Serial Number : 53K7K7JPF
       Disk Size : 3000.5 GB (8.4/137.4/3000.5/801.5)
     Buffer Size : Unknown
     Queue Depth : 32
    # of Sectors : 5860533168
   Rotation Rate : 7200 RPM
       Interface : USB (Serial ATA)
   Major Version : ATA8-ACS
   Minor Version : ----
   Transfer Mode : SATA/150 | SATA/600
  Power On Hours : 6553 hours
  Power On Count : 45 count
     Temperature : 37 C (98 F)
   Health Status : Caution
        Features : S.M.A.R.T., APM, 48bit LBA, NCQ
       APM Level : 0080h [ON]
       AAM Level : ----

-- S.M.A.R.T. --------------------------------------------------------------
ID Cur Wor Thr RawValues(6) Attribute Name
01 _99 _99 _50 000000000000 Read Error Rate
02 100 100 _50 000000000000 Throughput Performance
03 100 100 __1 000000002E6B Spin-Up Time
04 100 100 __0 00000000002D Start/Stop Count
05 100 100 _50 000000001466 Reallocated Sectors Count
07 100 _99 _50 000000000000 Seek Error Rate
08 100 100 _50 000000000000 Seek Time Performance
09 _84 _84 __0 000000001999 Power-On Hours
0A 100 100 _30 000000000000 Spin Retry Count
0C 100 100 __0 00000000002D Power Cycle Count
BF 100 100 __0 000000000002 G-Sense Error Rate
C0 100 100 __0 000000000024 Power-off Retract Count
C1 100 100 __0 00000000003E Load/Unload Cycle Count
C2 100 100 __0 002D000F0025 Temperature
C4 100 100 __0 0000000002BD Reallocation Event Count
C5 100 100 __0 000000000013 Current Pending Sector Count
C6 100 100 __0 000000000000 Uncorrectable Sector Count
C7 200 200 __0 000000000000 UltraDMA CRC Error Count
DC 100 100 __0 000000000000 Disk Shift
DE _84 _84 __0 000000001983 Loaded Hours
DF 100 100 __0 000000000000 Load/Unload Retry Count
E0 100 100 __0 000000000000 Load Friction
E2 100 100 __0 000000000069 Load 'In'-time
F0 __1 __1 __1 000000000010 Head Flying Hours

DrKK · Jul 29, 2015

You know, you can get smartctl for Windows. "SmartMonTools" or something it's called. All of us have it for our Windows boxes just for cases like this :)

0x1466 reallocated sectors, with more pending, is definitely way past the pining for the fjords stage. You definitely have the ghost ship spoken of in the Rime of the Ancient Mariner, with its gossamer sails, approaching with Life and Life-In-Death playing dice on the forecastle.

Ericloewe · Jul 31, 2015

DrKK said:
definitely way past the pining for the fjords stage.

Be sure to save that one for the next time someone shows up with a dead WD Blue. :D

Important Announcement for the TrueNAS Community.

Degraded Volume,discuss before I replace disk

JayG30

Contributor

JayG30

Contributor

Robert Trevellyan

Pony Wrangler

DrKK

FreeNAS Generalissimo

anodos

Sambassador

Robert Trevellyan

Pony Wrangler

JayG30

Contributor

JayG30

Contributor

DrKK

FreeNAS Generalissimo

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Degraded Volume,discuss before I replace disk

Contributor

Contributor

Pony Wrangler

FreeNAS Generalissimo

Sambassador

Pony Wrangler

Contributor

Contributor

FreeNAS Generalissimo

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Degraded Volume,discuss before I replace disk"

Similar threads