barbierimc
Dabbler
- Joined
- Jun 25, 2016
- Messages
- 22
Some quick background, FreeNAS 11.1-U4 is installed on an ESXi 6.5 vm with 20GB RAM allocated (ECC). For what it's worth, main system board is a Supermicro X11SSM-F. The vm runs on Datastore backed by a mirrored pair of SSD's on an M1115 card in RAID1
For my zfs pool I'm using a second M1115 controller flashed to IT mode, in pass-thru with 6 WDRED's in raid-z2. I'm not using encryption. An S3500 is connected as a SLOG. In the GUI I've set Swap to 0 and I run a dedicated swap file. All has been going well for 2 years and I've recently moved this from a 9.10 installation.
A few days ago I had SMART errors on one drive, still under warranty. I offlined the drive, then shutdown the system and removed the offending drive. Powered back up and everything was running well in degraded mode. So far, all good.
Today I wanted to install the new drive. I researched the topic and (maybe stupidly) decided to connect the drive hot. Maybe this was a mistake, maybe not.
Immediately before connecting the drive, I ran zpool status, then I ran it again straight after connecting the drive and immediately I had about 50 write errors, if I recall correctly, spread across two disks.
zpool status -v reported an error in .system/samba4 and several errors on an nfs dataset which is used as an ESXi datastore containing some low-criticality vm's (which were running at the time). A data error in those would not be catastrophic (but also not desirable)
At this stage I tried to replace the new disk (through the GUI) but it would not complete as it reported that pool I/O is currently suspended.
Eventually, I ran zpool clear which cleared all the errors. I then tried the 'replace' operation again which need to be forceded as it was reporting a previous zfs label on the disk (I presume from the first attempt) and it is now resilvering without complaint.
dmesg is filled with this type of error
It feels like the M1115 dropped out momentarily when the device was connected.
Interestingly, this is at the tail of dmesg (da7 is the new drive)
I'm not sure if the primary GPT table corruption is a symptom or cause of this problem, maybe this disk is faulty? (I do admit to not testing it offline beforehand - slap) Any advice on 'recovery' from this error would be appreciated.
Except for the GPT table corruption, everything seems fine. I do have backups of critical data so that's not a major concern, although I won't be repeating this exercise as the time (and anxiety) wasted on analysing this wasn't worth the downtime saving on shutting down the system. Lesson learned.
However, I'm interested in knowing:
zpool status currently looks like this
gpart list da7
For my zfs pool I'm using a second M1115 controller flashed to IT mode, in pass-thru with 6 WDRED's in raid-z2. I'm not using encryption. An S3500 is connected as a SLOG. In the GUI I've set Swap to 0 and I run a dedicated swap file. All has been going well for 2 years and I've recently moved this from a 9.10 installation.
A few days ago I had SMART errors on one drive, still under warranty. I offlined the drive, then shutdown the system and removed the offending drive. Powered back up and everything was running well in degraded mode. So far, all good.
Today I wanted to install the new drive. I researched the topic and (maybe stupidly) decided to connect the drive hot. Maybe this was a mistake, maybe not.
Immediately before connecting the drive, I ran zpool status, then I ran it again straight after connecting the drive and immediately I had about 50 write errors, if I recall correctly, spread across two disks.
zpool status -v reported an error in .system/samba4 and several errors on an nfs dataset which is used as an ESXi datastore containing some low-criticality vm's (which were running at the time). A data error in those would not be catastrophic (but also not desirable)
At this stage I tried to replace the new disk (through the GUI) but it would not complete as it reported that pool I/O is currently suspended.
Eventually, I ran zpool clear which cleared all the errors. I then tried the 'replace' operation again which need to be forceded as it was reporting a previous zfs label on the disk (I presume from the first attempt) and it is now resilvering without complaint.
dmesg is filled with this type of error
Code:
nfsd: can't register svc name (da2:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 e8 00 00 00 08 00 00 length 4096 SMID 255 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0 (da1:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 f0 00 00 00 08 00 00 length 4096 SMID 1010 terminated ioc 804b l(da2:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 e8 00 00 00 08 00 00 oginfo 31110d00 scsi 0 state c xfer 0 (da2:mps0:0:7:0): CAM status: CCB request completed with an error (da2:mps0:0:7:0): Retrying command (da1:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 f0 00 00 00 08 00 00 (da1:mps0:0:5:0): CAM status: CCB request completed with an error (da1:mps0:0:5:0): Retrying command (da2:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 e8 00 00 00 08 00 00 (da2:mps0:0:7:0): CAM status: SCSI Status Error (da2:mps0:0:7:0): SCSI status: Check Condition (da2:mps0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) (da2:mps0:0:7:0): Retrying command (per sense data) (da1:mps0:0:5:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 f0 00 00 00 08 00 00 (da1:mps0:0:5:0): CAM status: SCSI Status Error (da1:mps0:0:5:0): SCSI status: Check Condition (da1:mps0:0:5:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) (da1:mps0:0:5:0): Retrying command (per sense data) (da2:mps0:0:7:0): WRITE(16). CDB: 8a 00 00 00 00 01 54 17 c0 e8 00 00 00 08 00 00 (da2:mps0:0:7:0): CAM status: SCSI Status Error (da2:mps0:0:7:0): SCSI status: Check Condition (da2:mps0:0:7:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
It feels like the M1115 dropped out momentarily when the device was connected.
Interestingly, this is at the tail of dmesg (da7 is the new drive)
Code:
mps0: SAS Address for SATA device = 48634b4fb3cbc687 mps0: SAS Address from SATA device = 48634b4fb3cbc687 da7 at mps0 bus 0 scbus33 target 19 lun 0 da7: <ATA WDC WD60EFRX-68L 0A82> Fixed Direct Access SPC-4 SCSI device da7: Serial Number WD-WX21D18E0FPL da7: 600.000MB/s transfers da7: Command Queueing enabled da7: 5723166MB (11721045168 512 byte sectors) da7: quirks=0x8<4K> GEOM: da7: the primary GPT table is corrupt or invalid. GEOM: da7: using the secondary instead -- recovery strongly advised.
I'm not sure if the primary GPT table corruption is a symptom or cause of this problem, maybe this disk is faulty? (I do admit to not testing it offline beforehand - slap) Any advice on 'recovery' from this error would be appreciated.
Except for the GPT table corruption, everything seems fine. I do have backups of critical data so that's not a major concern, although I won't be repeating this exercise as the time (and anxiety) wasted on analysing this wasn't worth the downtime saving on shutting down the system. Lesson learned.
However, I'm interested in knowing:
- Did I misunderstand something here about the hot plug compatibility of this system?
- Did I miss a critical step?
- If the errors have cleared (with zpool clear) and a scrub returns no errors, is it likely/probable that I have actually suffered any data loss or corruption?
- Aside from the obvious (check disk offline, shutdown to replace disk) should I have done anything differently?
- What do I need to do about the GPT table corruption?
zpool status currently looks like this
Code:
pool: freenas-boot state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors pool: tank state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Apr 6 18:27:11 2018 3.54T scanned at 378M/s, 2.88T issued at 307M/s, 8.51T total 490G resilvered, 33.79% done, 0 days 05:21:01 to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/73fb9b36-3943-11e6-817c-000c29eb91f4 ONLINE 0 0 0 gptid/2b5af577-3974-11e8-94a0-000c29e7aafe ONLINE 0 0 0 (resilvering) gptid/75d83151-3943-11e6-817c-000c29eb91f4 ONLINE 0 0 0 gptid/76c8c0c3-3943-11e6-817c-000c29eb91f4 ONLINE 0 0 0 gptid/77baab81-3943-11e6-817c-000c29eb91f4 ONLINE 0 0 0 gptid/78a6b026-3943-11e6-817c-000c29eb91f4 ONLINE 0 0 0 logs gptid/8ce632ec-4832-11e6-af5b-000c29eb91f4 ONLINE 0 0 0 errors: No known data errors
gpart list da7
Code:
Geom name: da7 modified: false state: OK fwheads: 255 fwsectors: 63 last: 11721045127 first: 40 entries: 128 scheme: GPT Providers: 1. Name: da7p1 Mediasize: 6001175035904 (5.5T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e2 rawuuid: 2b5af577-3974-11e8-94a0-000c29e7aafe rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b label: (null) length: 6001175035904 offset: 65536 type: freebsd-zfs index: 1 end: 11721045119 start: 128 Consumers: 1. Name: da7 Mediasize: 6001175126016 (5.5T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e3
Last edited: