Hi all, I've been running 3 boxes with FN for a couple of years without much to worry about in a small business scenario. Now (finally) I'm having some trouble with one of the boxes, luckily it's not a mission critical one, just a NAS used to backup VMs, and backup the main FN FileServer (sanpshot replications every 3 hs)
The box with which I'm having trouble is:
Server: IBM System X3200 M2 (old one... I know)
Build: FreeNAS-11.0-U4 (54848d13b)
Platform: Intel(R) Xeon(R) CPU E3110 @ 3.00GHz
Memory: 8152MB
Network: not sure if you feel it's relevant please let me know and I'll try to figure it out
HBA: flashed IBM 1015
Disks: 7 SATA WD 3TB Red and an Intel SSD (with capacitor) as a SLOG device (otherwise writing from VMWare through NFS is painfully slow)
FN's configured with one zvol comprised of one vdev which is a 6 HDs Raid-Z2, the other HD is a Spare. Also for the boot volumen I use a couple of mirrored USBs.
I hope I haven't forgotten anything meaningful in the hard/soft description, in case I did just let me know.
I got a couple of alarms regarding pending sectors and decided to replace one disk with the Spare drive, that's when everything started to get complicated. Resilvering started, I thought it was going to be an easy job (I know I was very innocent to say the least...) and forgot to turn the backup off, at 00 hs backup started (while resilvering) and got errors on different disks, by 1 am I had 2 FAULT disks. I stopped the backup, did a shutdown on the box, got to the office early and physically replaced one of the FAULT disks (previously I turned on the box, did a detach, turned off again, replaced it and turned it on)
Resilver stopped and restarted at least once (I don't know why) and I'm seeing different messages depending on where I look, so I'm a bit confused:
- on the Alert System Status (top right LED like indicator, red and CRITICAL) I see legends regarding unreadable sectors in da0, da3 and da4, as well as a legend regarding the volumen status which states is online, but could be degraded:
After one of this errors pictured above, the resilver process restarted out of nothing (it was 10.9% and went back to 0% after what appeared in the reports like a period of disk inactivity)
Even though I feel totally dispaired and frustrated, the Vol is showing an ONLINE status... (used to be degraded no long ago, don't know why it went back to ONLINE even though the resilver hasn't finished)
I've been reading, trying to sort out the scenario but I'm still not sure wether I should be panicking or not.
This are the smart -a for each WD RED device in a file (/dev/da5 is the SLOG drive; /dev/da7 is the new replaced disk and hasn't got SmartTests on it every other disk performs a short weekly test a long monthly test)
Thanks a lot and sorry if the post is a bit too long.
(corrected a typo)
The box with which I'm having trouble is:
Server: IBM System X3200 M2 (old one... I know)
Build: FreeNAS-11.0-U4 (54848d13b)
Platform: Intel(R) Xeon(R) CPU E3110 @ 3.00GHz
Memory: 8152MB
Network: not sure if you feel it's relevant please let me know and I'll try to figure it out
HBA: flashed IBM 1015
Disks: 7 SATA WD 3TB Red and an Intel SSD (with capacitor) as a SLOG device (otherwise writing from VMWare through NFS is painfully slow)
FN's configured with one zvol comprised of one vdev which is a 6 HDs Raid-Z2, the other HD is a Spare. Also for the boot volumen I use a couple of mirrored USBs.
I hope I haven't forgotten anything meaningful in the hard/soft description, in case I did just let me know.
I got a couple of alarms regarding pending sectors and decided to replace one disk with the Spare drive, that's when everything started to get complicated. Resilvering started, I thought it was going to be an easy job (I know I was very innocent to say the least...) and forgot to turn the backup off, at 00 hs backup started (while resilvering) and got errors on different disks, by 1 am I had 2 FAULT disks. I stopped the backup, did a shutdown on the box, got to the office early and physically replaced one of the FAULT disks (previously I turned on the box, did a detach, turned off again, replaced it and turned it on)
Resilver stopped and restarted at least once (I don't know why) and I'm seeing different messages depending on where I look, so I'm a bit confused:
- on the Alert System Status (top right LED like indicator, red and CRITICAL) I see legends regarding unreadable sectors in da0, da3 and da4, as well as a legend regarding the volumen status which states is online, but could be degraded:
- On the other hand there are errors on the console log which are not reflected in the System Alerts, and they look like this:Feb 12 09:16:24 freenas2 smartd[7144]: Device: /dev/da4 [SAT], 3 Currently unreadable (pending) sectors
Feb 12 09:16:25 freenas2 smartd[7144]: Device: /dev/da3 [SAT], 2 Currently unreadable (pending) sectors
Feb. 12, 2019, 9:15 a.m. - The volume FN2-Z2 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
Feb 12 09:16:24 freenas2 smartd[7144]: Device: /dev/da4 [SAT], 3 Currently unreadable (pending) sectors
Feb 12 09:16:25 freenas2 smartd[7144]: Device: /dev/da3 [SAT], 2 Currently unreadable (pending) sectors
Feb 12 09:46:26 freenas2 smartd[7158]: Device: /dev/da0 [SAT], 1 Currently unreadable (pending) sectors
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): READ(10). CDB: 28 00 6b c1 27 a0 00 01 00 00
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): CAM status: CCB request completed with an error
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): Retrying command
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): READ(10). CDB: 28 00 6b c1 27 18 00 00 88 00
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): CAM status: SCSI Status Error
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): SCSI status: Check Condition
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): Info: 0x6bc12750
Feb 12 09:51:04 freenas2 (da4:mps0:0:5:0): Error 5, Unretryable error
...
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): READ(10). CDB: 28 00 05 fa 35 d8 00 01 00 00
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): CAM status: SCSI Status Error
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): SCSI status: Check Condition
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): Info: 0x5fa3690
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): Error 5, Unretryable error
...
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): READ(10). CDB: 28 00 06 7c 9f 70 00 00 28 00
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): CAM status: CCB request completed with an error
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): Retrying command
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): READ(10). CDB: 28 00 06 7e 88 e0 00 00 28 00
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): CAM status: SCSI Status Error
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): SCSI status: Check Condition
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): Info: 0x67e88f8
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): Error 5, Unretryable error
..
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): READ(10). CDB: 28 00 06 f7 26 60 00 01 00 00
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): CAM status: CCB request completed with an error
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): Retrying command
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): READ(10). CDB: 28 00 06 f7 25 70 00 00 f0 00
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): CAM status: SCSI Status Error
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): SCSI status: Check Condition
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): Info: 0x6f725c0
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): Error 5, Unretryable error
After one of this errors pictured above, the resilver process restarted out of nothing (it was 10.9% and went back to 0% after what appeared in the reports like a period of disk inactivity)
Even though I feel totally dispaired and frustrated, the Vol is showing an ONLINE status... (used to be degraded no long ago, don't know why it went back to ONLINE even though the resilver hasn't finished)
I've been reading, trying to sort out the scenario but I'm still not sure wether I should be panicking or not.
This are the smart -a for each WD RED device in a file (/dev/da5 is the SLOG drive; /dev/da7 is the new replaced disk and hasn't got SmartTests on it every other disk performs a short weekly test a long monthly test)
Thanks a lot and sorry if the post is a bit too long.
(corrected a typo)
Attachments
Last edited: