Problems while resilvering

Matute · Feb 12, 2019

Hi all, I've been running 3 boxes with FN for a couple of years without much to worry about in a small business scenario. Now (finally) I'm having some trouble with one of the boxes, luckily it's not a mission critical one, just a NAS used to backup VMs, and backup the main FN FileServer (sanpshot replications every 3 hs)

The box with which I'm having trouble is:
Server: IBM System X3200 M2 (old one... I know)
Build: FreeNAS-11.0-U4 (54848d13b)
Platform: Intel(R) Xeon(R) CPU E3110 @ 3.00GHz
Memory: 8152MB
Network: not sure if you feel it's relevant please let me know and I'll try to figure it out
HBA: flashed IBM 1015
Disks: 7 SATA WD 3TB Red and an Intel SSD (with capacitor) as a SLOG device (otherwise writing from VMWare through NFS is painfully slow)

FN's configured with one zvol comprised of one vdev which is a 6 HDs Raid-Z2, the other HD is a Spare. Also for the boot volumen I use a couple of mirrored USBs.

I hope I haven't forgotten anything meaningful in the hard/soft description, in case I did just let me know.

I got a couple of alarms regarding pending sectors and decided to replace one disk with the Spare drive, that's when everything started to get complicated. Resilvering started, I thought it was going to be an easy job (I know I was very innocent to say the least...) and forgot to turn the backup off, at 00 hs backup started (while resilvering) and got errors on different disks, by 1 am I had 2 FAULT disks. I stopped the backup, did a shutdown on the box, got to the office early and physically replaced one of the FAULT disks (previously I turned on the box, did a detach, turned off again, replaced it and turned it on)

Resilver stopped and restarted at least once (I don't know why) and I'm seeing different messages depending on where I look, so I'm a bit confused:
- on the Alert System Status (top right LED like indicator, red and CRITICAL) I see legends regarding unreadable sectors in da0, da3 and da4, as well as a legend regarding the volumen status which states is online, but could be degraded:

Feb 12 09:16:24 freenas2 smartd[7144]: Device: /dev/da4 [SAT], 3 Currently unreadable (pending) sectors
Feb 12 09:16:25 freenas2 smartd[7144]: Device: /dev/da3 [SAT], 2 Currently unreadable (pending) sectors
Feb. 12, 2019, 9:15 a.m. - The volume FN2-Z2 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

- On the other hand there are errors on the console log which are not reflected in the System Alerts, and they look like this:

Feb 12 09:16:24 freenas2 smartd[7144]: Device: /dev/da4 [SAT], 3 Currently unreadable (pending) sectors
Feb 12 09:16:25 freenas2 smartd[7144]: Device: /dev/da3 [SAT], 2 Currently unreadable (pending) sectors
Feb 12 09:46:26 freenas2 smartd[7158]: Device: /dev/da0 [SAT], 1 Currently unreadable (pending) sectors

Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): READ(10). CDB: 28 00 6b c1 27 a0 00 01 00 00
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): CAM status: CCB request completed with an error
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): Retrying command
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): READ(10). CDB: 28 00 6b c1 27 18 00 00 88 00
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): CAM status: SCSI Status Error
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): SCSI status: Check Condition
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 09:51:03 freenas2 (da4:mps0:0:5:0): Info: 0x6bc12750
Feb 12 09:51:04 freenas2 (da4:mps0:0:5:0): Error 5, Unretryable error
...
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): READ(10). CDB: 28 00 05 fa 35 d8 00 01 00 00
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): CAM status: SCSI Status Error
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): SCSI status: Check Condition
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): Info: 0x5fa3690
Feb 12 10:31:18 freenas2 (da6:mps0:0:9:0): Error 5, Unretryable error
...
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): READ(10). CDB: 28 00 06 7c 9f 70 00 00 28 00
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): CAM status: CCB request completed with an error
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): Retrying command
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): READ(10). CDB: 28 00 06 7e 88 e0 00 00 28 00
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): CAM status: SCSI Status Error
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): SCSI status: Check Condition
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): Info: 0x67e88f8
Feb 12 10:40:31 freenas2 (da0:mps0:0:0:0): Error 5, Unretryable error
..
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): READ(10). CDB: 28 00 06 f7 26 60 00 01 00 00
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): CAM status: CCB request completed with an error
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): Retrying command
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): READ(10). CDB: 28 00 06 f7 25 70 00 00 f0 00
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): CAM status: SCSI Status Error
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): SCSI status: Check Condition
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): Info: 0x6f725c0
Feb 12 10:41:18 freenas2 (da2:mps0:0:2:0): Error 5, Unretryable error

After one of this errors pictured above, the resilver process restarted out of nothing (it was 10.9% and went back to 0% after what appeared in the reports like a period of disk inactivity)

Even though I feel totally dispaired and frustrated, the Vol is showing an ONLINE status... (used to be degraded no long ago, don't know why it went back to ONLINE even though the resilver hasn't finished)

I've been reading, trying to sort out the scenario but I'm still not sure wether I should be panicking or not.

This are the smart -a for each WD RED device in a file (/dev/da5 is the SLOG drive; /dev/da7 is the new replaced disk and hasn't got SmartTests on it every other disk performs a short weekly test a long monthly test)

Thanks a lot and sorry if the post is a bit too long.
(corrected a typo)

Matute · Feb 12, 2019

Update: I found something, maybe it's obvious... It was almost 30% into the resilver process when it suddenly restarted. It's connected with this message I got in the log:

Code:

Feb 12 13:14:47 freenas2 (da3:mps0:0:3:0): WRITE(10). CDB: 2a 00 24 0e 57 98 00 00 30 00 
Feb 12 13:14:47 freenas2 (da3:mps0:0:3:0): CAM status: SCSI Status Error
Feb 12 13:14:47 freenas2 (da3:mps0:0:3:0): SCSI status: Check Condition
Feb 12 13:14:47 freenas2 (da3:mps0:0:3:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Feb 12 13:14:47 freenas2 (da3:mps0:0:3:0): Info: 0x240e5798
Feb 12 13:14:47 freenas2 (da3:mps0:0:3:0): Error 22, Unretryable error
Feb 12 13:14:53 freenas2 ZFS: vdev state changed, pool_guid=2621212020598560952 vdev_guid=16582672447477589141

After that message, resilvering restarted. zpool status prior to the error and after the error:
BEFORE:

Code:

zpool status FN2-Z2
  pool: FN2-Z2
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Feb 12 10:31:25 2019
        2.54T scanned out of 8.81T at 277M/s, 6h35m to go
        439G resilvered, 28.87% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    FN2-Z2                                          ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/076d154f-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0  (resilvering)
        gptid/082f7aae-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0
        gptid/08f29b1d-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0  (resilvering)
        gptid/55a65f54-6a8c-11e8-997d-00215e46bcea  ONLINE       0     1     0  (resilvering)
        gptid/89d627ab-2e43-11e9-9c39-00215e46bcea  ONLINE       0     0     0  (resilvering)
        gptid/0b24ab56-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0
    logs
      gptid/6c7cafab-31dc-11e7-922b-00215e46bcea    ONLINE       0     0     0
    spares
      gptid/eaaf51cf-2ebf-11e9-a92d-00215e46bcea    AVAIL   

errors: No known data errors

AFTER:

Code:

zpool status FN2-Z2
  pool: FN2-Z2
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Feb 12 13:15:00 2019
        9.50G scanned out of 8.81T at 62.4M/s, 41h6m to go
        1.52G resilvered, 0.11% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    FN2-Z2                                          ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/076d154f-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0
        gptid/082f7aae-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0
        gptid/08f29b1d-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0
        gptid/55a65f54-6a8c-11e8-997d-00215e46bcea  ONLINE       0     1     0  (resilvering)
        gptid/89d627ab-2e43-11e9-9c39-00215e46bcea  ONLINE       0     1     0  (resilvering)
        gptid/0b24ab56-250f-11e7-987e-00215e46bcea  ONLINE       0     0     0
    logs
      gptid/6c7cafab-31dc-11e7-922b-00215e46bcea    ONLINE       0     0     0
    spares
      gptid/eaaf51cf-2ebf-11e9-a92d-00215e46bcea    AVAIL   

errors: No known data errors

Is this a standard behaviour?
Thanks.

Chris Moore · Feb 12, 2019

Matute said:
Build: FreeNAS-11.0-U4 (54848d13b)

Later, not right now, you might want to upgrade to 11.1-U7, which is the latest stable version. Don't go to 11.2 yet because there are still too many bugs in that release.

Matute said:
for the boot volumen I use a couple of mirrored USBs.

Are you keeping a backup of the config.db file? USB boot media can fail without warning.

Matute said:
Is this a standard behaviour?

Not really. I bet these drives are very old.
How many drives are you trying to replace at the same time?
At this point, you need to let the resilver finish. On a RAIDz2 pool, you can legitimately replace two drives at the same time, but it leaves you with no redundancy. The errors you are showing us are 'just' bad sectors, so you should be fine to let one drive resilver, then replace the next, then replace the next. Which ones are you replacing now?
I would have gone with:
First: Feb 12 09:16:24 freenas2 smartd[7144]: Device: /dev/da4 [SAT], 3 Currently unreadable (pending) sectors
Second: Feb 12 09:16:25 freenas2 smartd[7144]: Device: /dev/da3 [SAT], 2 Currently unreadable (pending) sectors
Come back and pickup the third one later.
Didn't you have a hot-spare in the system?
Did it ever do anything?

Chris Moore · Feb 12, 2019

You are getting errors on several drives, so it is probably time to look at replacing them all, one at a time.

Matute · Feb 12, 2019

Chris, thanks a lot for your reply. I'm keeping a copy of the config even though I have a mirrored USB for booting (just in case...)
I'm trying to replace only one at a time, otherwise it would be like having a Z1 which I thought was higly risky with 3TB HDs or am I wrong?
Disks are not "so" old, about 1.8 years, or 15700 hours. All of them spent in a sort of data center (temperature control, working 24x7, UPS, etc) Being that they have a 3 years warranty I assume they are not so old.
Regarding the hot spare, it automatically went live when I took offline one of the faulty disks.
My plan now is hope for the resilver to finish, and then start replacing the other disks. The one thing that worries me is that every now and then I get an error (in the log) such as:

Code:

Feb 12 15:17:32 freenas2 (da6:mps0:0:9:0): READ(10). CDB: 28 00 05 fa 38 d8 00 01 00 00 
Feb 12 15:17:32 freenas2 (da6:mps0:0:9:0): CAM status: SCSI Status Error
Feb 12 15:17:32 freenas2 (da6:mps0:0:9:0): SCSI status: Check Condition
Feb 12 15:17:32 freenas2 (da6:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 15:17:32 freenas2 (da6:mps0:0:9:0): Info: 0x5fa3960
Feb 12 15:17:32 freenas2 (da6:mps0:0:9:0): Error 5, Unretryable error
Feb 12 15:17:38 freenas2 ZFS: vdev state changed, pool_guid=2621212020598560952 vdev_guid=2858672781549027275
Feb 12 15:17:51 freenas2     (da6:mps0:0:9:0): READ(10). CDB: 28 00 05 fa 62 a8 00 01 00 00 length 131072 SMID 676 terminated ioc 804b scsi 0 state 0 xfer 0
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): READ(10). CDB: 28 00 05 fa 62 a8 00 01 00 00 
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): CAM status: CCB request completed with an error
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): Retrying command
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): READ(10). CDB: 28 00 05 fa 61 a8 00 01 00 00 
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): CAM status: SCSI Status Error
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): SCSI status: Check Condition
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): Info: 0x5fa61e0
Feb 12 15:17:51 freenas2 (da6:mps0:0:9:0): Error 5, Unretryable error
Feb 12 15:17:51 freenas2 ZFS: vdev state changed, pool_guid=2621212020598560952 vdev_guid=2858672781549027275

and the resilver restarts.

One thing I would love to know is wether all that was written previous to the restart is then rewritten or is kept for good.

Thanks again for the support.

Chris Moore · Feb 12, 2019

Matute said:
like having a Z1 which I thought was higly risky with 3TB HDs or am I wrong?

Certainly, RAIDz1 is not something I use for storage that is important.

Matute said:
Disks are not "so" old, about 1.8 years, or 15700 hours.

That isn't very old but it does depend on exactly what model we are talking about. These are the WDC WD30EFRX drives that should have a 3 year warranty. They are young enough that they should still be in warranty, but I was looking at the errors and I would say that they need to go back. This must have been a bad batch or something.
You might want to setup a monitoring script to send reports because some of these errors should have triggered a replacement before now.

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

You don't want to wait until FreeNAS decides the drive is failed. You have to look at these numbers and decide for yourself when a drive is not performing up to standard.

You have four drives that are exhibiting a non-zero number for this element:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       4

On a Western Digital drive, when this is not zero, there is a problem.

The next thing is this:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 196 Reallocated_Event_Count 0x0032   103   103   000    Old_age   Always       -       97

As soon as a drive has a reallocated sector, it should be sent back for warranty replacement and you have a couple that are non-zero values.

The next thing is this:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       4

If you have pending sectors, it is probably a good idea to be proactive about sending those back for warranty replacement too and you have a couple that are non-zero values there also.

Basically, these drives were let run with little monitoring until FreeNAS just couldn't take it any more and the least little bit of stress, trying to rebuild the pool while also doing a backup, caused more errors.

Chris Moore · Feb 12, 2019

PS. See this:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15724         -
# 2  Extended offline    Completed: read failure       90%     15707         105818168
# 3  Short offline       Completed: read failure       10%     15556         105809112
# 4  Short offline       Completed without error       00%     15388         -
# 5  Short offline       Completed without error       00%     15220         -
# 6  Short offline       Completed without error       00%     15052         -
# 7  Extended offline    Completed: read failure       90%     15035         4343656
# 8  Short offline       Completed without error       00%     14884         -
# 9  Short offline       Completed without error       00%     14717         -
#10  Short offline       Completed without error       00%     14549         -
#11  Short offline       Completed without error       00%     14381         -
#12  Short offline       Completed: read failure       10%     14213         105812152
#13  Extended offline    Completed without error       00%     14204         -
#14  Short offline       Completed without error       00%     14045         -
#15  Short offline       Completed without error       00%     13877         -
#16  Short offline       Completed without error       00%     13711         -
#17  Short offline       Completed without error       00%     13544         -
#18  Extended offline    Completed: read failure       90%     13527         105818600
#19  Short offline       Completed without error       00%     13376         -
#20  Short offline       Completed without error       00%     13208         -
#21  Short offline       Completed without error       00%     13040         -
1 of 5 failed self-tests are outdated by newer successful extended offline self-test #13

If you have a drive that fails a single self test, it should immediately be replaced. This drive has been failing for at least three months and nobody was looking and that isn't the only one...

Chris Moore · Feb 12, 2019

I just finished going through, in detail, that text file you attached with the SMART results from the drives. There are FOUR drives in there that have failed their self test at least once. All of those should have been replaced the first time they failed a self test. It is clear that nobody is minding this server. If you have other servers, they may well be in similar condition. Please check them all because this has been on its last leg for months and nobody noticed.

Matute · Feb 12, 2019

Chris you're absolutely right. Not being one of the "important" boxes I let this one keep on warning me with not much action taken. I had so many warnings that I thought it was "usual" or something because I couldn't believe I had something like 5 out of 7 HDs wrong. Now (and you helped me a lot with the smart test analysis) I know I was wrong. They're in fact WD30EFRX so I'll return them as soon as I finish rebuilding everything.
Once again: thanks a lot!
BTW do you know if once the resilver writes a block it is reused or rather rewritten if it restarts? (sorry it seems a tongue twister)

Chris Moore · Feb 12, 2019

Matute said:
BTW do you know if once the resilver writes a block it is reused or rather rewritten if it restarts? (sorry it seems a tongue twister)

The resilver and scrub both share some code and the process must start at the "top" and navigate the entire data-structure. So, if it is interrupted, it will start again at the beginning, but it doesn't need to write data that already has been written. It is working to make the replacement disk match the condition it is supposed to have.

Chris Moore · Feb 12, 2019

Matute said:
I couldn't believe I had something like 5 out of 7 HDs wrong

That is why I am thinking you must have gotten a bad batch of drives. That is a super high failure rate. I setup a server two years ago with 60 drives and I have only needed to replace five drives in the whole time the system has been running. Five out of 60 is not so bad, but five out of 7 is really a lot of bad drives. I am pretty shocked.

Matute · Feb 13, 2019

Chris Moore said:
The resilver and scrub both share some code and the process must start at the "top" and navigate the entire data-structure. So, if it is interrupted, it will start again at the beginning, but it doesn't need to write data that already has been written. It is working to make the replacement disk match the condition it is supposed to have.

Chris, I'm confused regarding the resilvering process, I took a while to answer because I wanted to double check my findings. There's a post in which I read that resilvering can support restarts and even crashes just loosing a couple of minutes of what has been processed:

Ericloewe said:
In the near future, optimizations to resilver/scrub will mean that up to a few minutes of scrubbing will be lost.

Some more comments on that thread stating more or less the same.

My resilver process stopped (automagically) after errors on another disk and restarted (restarted, not resumed) 4 times before the offending drive was (automatically) set as FAULT by FreeNas. Each time the disk that was being resilvered on the first place, the new one started all over again and if I have to trust the reports screen, every time data was written again on it. It's a 3TB HD, and I already see 3.26TB written to it (by the way my volumen is half full, and I think every disk must have around 1.7 TB of data or less... so I feel it's actually restarting the resilver and ignoring everything that has been done so far (can it be that the error is naughty enough to trigger that?) If I knew that was the case I think I would have taken the other disk offline manually. Am I interpreting something wrong?

What I see in the console previous to that kind of behaviour is something like this:

Code:

Feb 12 09:18:16 freenas2 (da6:mps0:0:9:0): WRITE(10). CDB: 2a 00 06 ec 41 50 00 00 10 00 
Feb 12 09:18:16 freenas2 (da6:mps0:0:9:0): CAM status: SCSI Status Error
Feb 12 09:18:16 freenas2 (da6:mps0:0:9:0): SCSI status: Check Condition
Feb 12 09:18:16 freenas2 (da6:mps0:0:9:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Feb 12 09:18:16 freenas2 (da6:mps0:0:9:0): Info: 0x6ec4150
Feb 12 09:18:16 freenas2 (da6:mps0:0:9:0): Error 22, Unretryable error
Feb 12 09:18:41 freenas2 ZFS: vdev state changed, pool_guid=2621212020598560952 vdev_guid=2858672781549027275
Feb 12 09:42:08 freenas2 (da0:mps0:0:0:0): READ(10). CDB: 28 00 06 7e 88 a0 00 00 f8 00 
Feb 12 09:42:08 freenas2 (da0:mps0:0:0:0): CAM status: SCSI Status Error
Feb 12 09:42:08 freenas2 (da0:mps0:0:0:0): SCSI status: Check Condition
Feb 12 09:42:08 freenas2 (da0:mps0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 09:42:08 freenas2 (da0:mps0:0:0:0): Info: 0x67e88f8
Feb 12 09:42:08 freenas2 (da0:mps0:0:0:0): Error 5, Unretryable error
Feb 12 09:42:11 freenas2 (da0:mps0:0:0:0): READ(10). CDB: 28 00 06 7e 89 c8 00 01 00 00 
Feb 12 09:42:11 freenas2 (da0:mps0:0:0:0): CAM status: SCSI Status Error
Feb 12 09:42:11 freenas2 (da0:mps0:0:0:0): SCSI status: Check Condition
Feb 12 09:42:11 freenas2 (da0:mps0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
Feb 12 09:42:11 freenas2 (da0:mps0:0:0:0): Info: 0x67e89d0
Feb 12 09:42:11 freenas2 (da0:mps0:0:0:0): Error 5, Unretryable error
Feb 12 09:42:12 freenas2 ZFS: vdev state changed, pool_guid=2621212020598560952 vdev_guid=1326641505610154317

Thanks again!

Chris Moore · Feb 13, 2019

Matute said:
(can it be that the error is naughty enough to trigger that?)

Judging by what I saw in the SMART data that you posted, yes, there were four drives that have read failure test results. So it is easily possible that those drives are not going to work correctly.
In your position, I would have kept the original drives in the pool and tried to do an "in-place" replacement so the system could still read from the failing drive while trying to construct the data on the replacement drive.
Unfortunately, the situation you are in now, your pool may not be recoverable. This is a backup server, if I understand correctly, so that may not be catastrophic, but more of a learning opportunity.

Matute said:
If I knew that was the case I think I would have taken the other disk offline manually.

I don't think taking disks offline is going to help your situation. Your pool only has two drives of redundancy, so if it looses a third drive, that would usually mean it is unrecoverable.

Matute · Feb 14, 2019

Chris, once again, thanks for answering. Just to clarify. I left every disk in the pool but one, the one I took to put a new one.
Trying to be clearer than I was:
The box is isolated from the network, and I made it as "read only" as I could.
da7 was the new Disk, of course it was being written (resilvering)
da6 (part of the old Z2, full of errors) was being read to resilver da7 but was also being corrected (resilvered) and every now and then you saw something being written there.
5 times the resilver failed and restarted (i assume) because of naughty errors on da6.
The last time (358 write errors, 6500 sectors reallocated) FreeNas finally set it as FAULT
After that the resilver ended fine (6th time it started)

I think that I would have been better off taking da6 offline.

After FN set da6 as FAULT, as I was saying, everything went smooth, resilver ended fine, I was running (much to my discomfort) without redundancy but the only thing I lost was a 4GB file (small backup file) which the resilver pointed as a data error (checked with zpool status -v, deleted it and next resilver was totally smooth, zero errors) now I'm replacing the other drives which were showing bad smart test results, but I'm doing this with a free slot in my server and using replace (in other words leaving the disk online). I think I won this time, just for a thin thin thin margin. Thanks a lot for your assistance.

Chris Moore · Feb 16, 2019

Matute said:
I think I won this time, just for a thin thin thin margin.

I am happy that it worked out for you. Keep an eye on those hard drives. They are out to ruin your day.

Important Announcement for the TrueNAS Community.

Problems while resilvering

Matute

Dabbler

Attachments

Matute

Dabbler

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Matute

Dabbler

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Matute

Dabbler

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

Matute

Dabbler

Chris Moore

Hall of Famer

Matute

Dabbler

Chris Moore

Hall of Famer

Similar threads

Important Announcement for the TrueNAS Community.

Problems while resilvering

Dabbler

Attachments

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Problems while resilvering"

Similar threads