Degraded Volume,discuss before I replace disk

Status
Not open for further replies.

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
Hello,

Today I noticed one of my freenas servers was in a degraded state. I found out a bit late it seems because my email moved the messages to "clutter" (sigh). Anyway, I'm just trying to determine if anyone might see something other than a disk issue.

When I logged into the web GUI (and the CLI zpool status initially) the disk had a few hundred write errors showing for that disk.

In dmesg it showed;

Code:
		(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 555 command timeout cm 0xffffff8000b02718 ccb 0xfffffe004101f000
		(noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000b02718
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 50 00 00 40 00 length 32768 SMID 337 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 93 10 00 00 40 00 length 32768 SMID 363 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 d0 00 00 40 00 length 32768 SMID 841 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 90 00 00 40 00 length 32768 SMID 220 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 50 00 00 40 00 length 32768 SMID 748 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 92 10 00 00 40 00 length 32768 SMID 321 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 d0 00 00 40 00 length 32768 SMID 515 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 90 00 00 40 00 length 32768 SMID 745 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 91 50 00 00 40 00 length 32768 SMID 868 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): WRITE(10). CDB: 2a 00 10 74 8e a0 00 00 40 00 length 32768 SMID 632 terminated ioc 804b scsi 0 state c xfer 0
		(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 466 terminated ioc 804b scsi 0 state c xfer 0
mps0: IOCStatus = 0x4b while resetting device 0xf
(da5:mps0:0:13:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da5:mps0:0:13:0): CAM status: Command timeout
(da5:mps0:0:13:0): Retrying command
da5 at mps0 bus 0 scbus0 target 13 lun 0
da5: <ATA TOSHIBA MG03ACA3 FL1A> s/n			53K7K7JPF detached
(da5:mps0:0:13:0): Periph destroyed


The volume initially showed the disk as unavailable;
Code:
[root@freenas] ~# zpool status -v store
  pool: store
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0h26m with 0 errors on Sun Jul 19 00:26:39 2015
config:

		NAME											STATE	 READ WRITE CKSUM
		store										   DEGRADED	 0	 0	 0
		  raidz2-0									  DEGRADED	 0	 0	 0
			gptid/1c383e96-d315-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/90b50eaf-d315-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/284a6fc3-d316-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/c66e0391-d317-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			gptid/14a02475-d318-11e4-98c7-0cc47a335ac4  ONLINE	   0	 0	 0
			559548462891584750						  UNAVAIL	  3   246	 0  was /dev/gptid/5178ef38-d319-11e4-98c7-0cc47a335ac4


I rebooted the server but no change.

So I had someone on site remove the disk for me, 1 to get the S/N and second to see if I could online it and have it rebuild itself. After removing it the status of the disk changed to "removed". Subsequent reboots of the server have made the volume show as "resilvering" but the disk never came online, even after trying to force it online through zfs online command.

The disk is now back to showing "unavailable" as shown above.
EDIT: it seems the disk is now back to showing "removed". It seems to be unavailable on initial reboot while it resilvers and then goes to "removed".

Further more I can't even see the disk in smartctl. It just seems like it is being removed per the dmesg shown above, "(da5:mps0:0:13:0): Periph destroyed". I had hoped to try to check the smartctl readings, but can't since the disk isn't showing up at all.

My gut says the disk went bad. I filed a RMA for it and will go down to check it tomorrow. But perhaps someone might have an idea.
 
Last edited:

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
Some more information that I'm sure people will be looking for.
The disks are all the same make/model.
The server was put together in Late March/Early April.
The server was stress tested using the scripts jgreco posted in these forums somewhere. Had no issues.
This is the first real issue I've had with it.

Specifications;
Code:
Case: SuperMicro CSE-826E16-R1200LPB
Backplane: BPN-SAS2-826EL1
Motherboard: SUPERMICRO MBD-X10SL7-F-O
HBA: onboard LSI 2308 (firmware P16, as recommended by FreeNAS)
CPU: Intel Xeon E3-1231V3
RAM: Crucial CT2KIT102472BD160B (2 x 8GB)
HDD: 6 x Toshiba MG03ACA300 3TB Enterprise SATA
Norcoo SFF8087 reverse breakout cable


The errors at the top of dmesg look a lot like the ones in THIS thread.
 
Last edited:

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Yes, the drive in question, has joined the choir invisible, and needs to be replaced, sir.
 

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
lol glad you guys had some fun with your replies. :)
I figured it had gone bad. Surprisingly I've had very few bad drives over the years.
I figured I'd just check if anyone had anything to add in case it might be a backplane, expander, etc issue.
I'm going down now to check it out and if necessary remove the drive and prepare for RMA shipment.
It's in a RAIDZ2 and used for 2nd tier backups currently so not to concerned to be honest.
 

JayG30

Contributor
Joined
Jun 26, 2013
Messages
158
Well I took the disk out of the server and plugged it into my laptop using one of those multi-format USB adapter things I have.
As the disk started up I heard a LOT of noise (and it doesn't go away), which obviously was the first bad sign. It booted but being originally setup in FreeNAS windows can't see it. Loaded up magic partition to see if I could wipe it out. The disk shows only ~746GB (3TB disk). Obviously something wrong. But I was able to format that 746GB of space and use CrystalDisk to check SMART info.
Health Status: Caution
Reallocated Sectors Count: 100 100 50
Current Pending Sector Count: 100 100 0

So looks like disk went bad. Only 6553 hours it seems, not good.

Code:
----------------------------------------------------------------------------
CrystalDiskInfo 6.5.2 (C) 2008-2015 hiyohiyo
                                Crystal Dew World : http://crystalmark.info/
----------------------------------------------------------------------------
-- Disk List ---------------------------------------------------------------
(2) TOSHIBA MG03ACA300 : 3000.5 GB [1/X/X, jm1] (V=152D, P=2338)
----------------------------------------------------------------------------
(2) TOSHIBA MG03ACA300
----------------------------------------------------------------------------
       Enclosure : TOSHIBA MG03ACA300 USB Device (V=152D, P=2338, jm1)
           Model : TOSHIBA MG03ACA300
        Firmware : FL1A
   Serial Number : 53K7K7JPF
       Disk Size : 3000.5 GB (8.4/137.4/3000.5/801.5)
     Buffer Size : Unknown
     Queue Depth : 32
    # of Sectors : 5860533168
   Rotation Rate : 7200 RPM
       Interface : USB (Serial ATA)
   Major Version : ATA8-ACS
   Minor Version : ----
   Transfer Mode : SATA/150 | SATA/600
  Power On Hours : 6553 hours
  Power On Count : 45 count
     Temperature : 37 C (98 F)
   Health Status : Caution
        Features : S.M.A.R.T., APM, 48bit LBA, NCQ
       APM Level : 0080h [ON]
       AAM Level : ----

-- S.M.A.R.T. --------------------------------------------------------------
ID Cur Wor Thr RawValues(6) Attribute Name
01 _99 _99 _50 000000000000 Read Error Rate
02 100 100 _50 000000000000 Throughput Performance
03 100 100 __1 000000002E6B Spin-Up Time
04 100 100 __0 00000000002D Start/Stop Count
05 100 100 _50 000000001466 Reallocated Sectors Count
07 100 _99 _50 000000000000 Seek Error Rate
08 100 100 _50 000000000000 Seek Time Performance
09 _84 _84 __0 000000001999 Power-On Hours
0A 100 100 _30 000000000000 Spin Retry Count
0C 100 100 __0 00000000002D Power Cycle Count
BF 100 100 __0 000000000002 G-Sense Error Rate
C0 100 100 __0 000000000024 Power-off Retract Count
C1 100 100 __0 00000000003E Load/Unload Cycle Count
C2 100 100 __0 002D000F0025 Temperature
C4 100 100 __0 0000000002BD Reallocation Event Count
C5 100 100 __0 000000000013 Current Pending Sector Count
C6 100 100 __0 000000000000 Uncorrectable Sector Count
C7 200 200 __0 000000000000 UltraDMA CRC Error Count
DC 100 100 __0 000000000000 Disk Shift
DE _84 _84 __0 000000001983 Loaded Hours
DF 100 100 __0 000000000000 Load/Unload Retry Count
E0 100 100 __0 000000000000 Load Friction
E2 100 100 __0 000000000069 Load 'In'-time
F0 __1 __1 __1 000000000010 Head Flying Hours

 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You know, you can get smartctl for Windows. "SmartMonTools" or something it's called. All of us have it for our Windows boxes just for cases like this :)

0x1466 reallocated sectors, with more pending, is definitely way past the pining for the fjords stage. You definitely have the ghost ship spoken of in the Rime of the Ancient Mariner, with its gossamer sails, approaching with Life and Life-In-Death playing dice on the forecastle.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
definitely way past the pining for the fjords stage.
Be sure to save that one for the next time someone shows up with a dead WD Blue. :D
 
Status
Not open for further replies.
Top