I just had a disk drop off my main RAIDZ2 pool and I need some guidance to troubleshoot/recover.
I am running FreeNAS-8.3.1-RELEASE-x64 (r13452), and am getting ready to upgrade.
Disk Configuration:
/dev/ada0,1,2,3 - TANK - RAIDZ2
/dev/ada4,5 - SCRATCH - RAIDZ
/dev/ada6 - attached to a hot swap disk caddy for backup.
I have been using UFS over USB for backup, but since this isn't compatible (and not the most reliable, I am in the process of creating ZFS backups before I do the migration). I plan on moving to a larger case, adding memory and doing a processor upgrade just prior to the system upgrade.
I just successfully completed a complete rsync backup of "TANK" to a single disk ZFS pool I created on theremovable disk on /dev/ada6
I attached the disk, created a single disk pool BACKUP01, rsync'd the data on TANK to it, and then scrubbed BACKUP01 successfully.
BACKUP01 was still attached/spinning when the automatic scrub on TANK started.
I just happened to do a zpool status and I noticed that /dev/ada0 had DROPPED OFF from pool TANK
(The command
I did a check of
Here's the output from:
Here's the output from the nightly security message:
NOTE: The following series of messages is expected, and is caused by a KVM switch:
+ugen2.3: <Logitech> at usbus2
+ukbd0: <USB Keyboard> on usbus2
+kbd2 at ukbd0
+uhid0: <USB Keyboard> on usbus2
+ugen2.3: <Logitech> at usbus2 (disconnected)
+ukbd0: at uhub4, port 3, addr 3 (disconnected)
+uhid0: at uhub4, port 3, addr 3 (disconnected)
I have 4 computers on a KVM switch (Windows Box, 2 Linux Boxes, FreeNAS), so every time I switch from machine to machine the mouse/keyboard come and go. I have been doing this for several years without any problem.
The following messages were caused by an intentional "Detach" executed from the GUI followed by unplugging the drive.
+GEOM_ELI: Detached ada6p1.eli on last close.
+(ada6:ahcich1:0:0:0): lost device
+(ada6:ahcich1:0:0:0): removing device entry
I am not sure what this means?
+++ /tmp/security.JumAse6s 2017-08-20 03:01:02.000000000 -0400
@@ -4,7 +4,6 @@
Here are the last few lines of
Other than the device lost message there is no hint of anything happening.
After I noticed the error, I detached the backup for safekeeping and stopped the scrub with zpool scrub -s TANK
The system is sitting idle with the power still on and a console shell open so as not to destroy any information. Anything else I should check?
How can I determine why the drive became detached?
System History/Possible explanations:
It is possible that adding the hot swap disk caddy may have loosened a cable on /dev/ada0 (although the pool did run fine for a couple of days).
It is also possible a may have some freak memory error (not running ECC memory - system is built with an ASUS consumer grade motherboard.) The system has been very stable over the last year. Only one crash caused by a crash on a windows box that was connected by a CIFS share. When the Win box hung, FreeNAS got locked up and I had to reset it. Other than that been rock solid - no trouble at all.
Unless someone has a better (safer suggestion), my plan is to
Thanks in advance for any assistance/suggestions/recommendations.
I am running FreeNAS-8.3.1-RELEASE-x64 (r13452), and am getting ready to upgrade.
Disk Configuration:
/dev/ada0,1,2,3 - TANK - RAIDZ2
/dev/ada4,5 - SCRATCH - RAIDZ
/dev/ada6 - attached to a hot swap disk caddy for backup.
I have been using UFS over USB for backup, but since this isn't compatible (and not the most reliable, I am in the process of creating ZFS backups before I do the migration). I plan on moving to a larger case, adding memory and doing a processor upgrade just prior to the system upgrade.
I just successfully completed a complete rsync backup of "TANK" to a single disk ZFS pool I created on theremovable disk on /dev/ada6
I attached the disk, created a single disk pool BACKUP01, rsync'd the data on TANK to it, and then scrubbed BACKUP01 successfully.
BACKUP01 was still attached/spinning when the automatic scrub on TANK started.
I just happened to do a zpool status and I noticed that /dev/ada0 had DROPPED OFF from pool TANK
(The command
smartctl -a /dev/ada0
would not work, and ls /dev/ad*
showed it was gone).I did a check of
/var/log/messages
, and the only thing I could find relating to this error was:Code:
(ada0:ahcich2:0:0:0): lost device
Here's the output from:
zpool status TANK
Code:
pool: TANK state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scan: scrub in progress since Sun Aug 20 00:00:01 2017 3.97T scanned out of 14.1T at 421M/s, 6h59m to go 0 repaired, 28.15% done config: NAME STATE READ WRITE CKSUM TANK ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/73fc17c0-1916-11e5-b2b5-c860006d863b ONLINE 7 117K 0 gptid/bcd021f7-dbd8-11e4-adc9-c860006d863b ONLINE 0 0 0 gptid/fcd3d3cb-f747-11e4-95cc-c860006d863b ONLINE 0 0 0 gptid/76e6e5f1-0f39-11e5-8bc3-c860006d863b ONLINE 0 0 0 errors: No known data errors
Here's the output from the nightly security message:
Code:
freenas.WORKGROUP changes in mounted filesystems: --- /var/log/mount.today 2017-08-18 03:01:01.000000000 -0400 +++ /tmp/security.JumAse6s 2017-08-20 03:01:02.000000000 -0400 @@ -4,7 +4,6 @@ /dev/ufs/FreeNASs1a / ufs ro 1 1 /dev/ufs/FreeNASs4 /data ufs rw,noatime 2 2 /dev/ufs/USBDRIVE /mnt/USBDRIVE ufs rw 2 2 -BACKUP01 /mnt/BACKUP01 zfs rw,nfsv4acls 0 0 SCRATCH /mnt/SCRATCH zfs rw,nfsv4acls 0 0 TANK /mnt/TANK zfs rw,noatime,nfsv4acls 0 0 TANK/barry /mnt/TANK/barry zfs rw,noatime,nfsv4acls 0 0 Checking for uids of 0: root 0 Checking for passwordless accounts: Checking login.conf permissions: Checking for ports with mismatched checksums: freenas.WORKGROUP kernel log messages: +++ /tmp/security.vxG5TZCL 2017-08-20 03:01:03.000000000 -0400 +ugen2.3: <Logitech> at usbus2 +ukbd0: <USB Keyboard> on usbus2 +kbd2 at ukbd0 +uhid0: <USB Keyboard> on usbus2 +ugen2.3: <Logitech> at usbus2 (disconnected) +ukbd0: at uhub4, port 3, addr 3 (disconnected) +uhid0: at uhub4, port 3, addr 3 (disconnected) +ugen2.3: <Logitech> at usbus2 +ukbd0: <USB Keyboard> on usbus2 +kbd2 at ukbd0 +uhid0: <USB Keyboard> on usbus2 +ugen2.3: <Logitech> at usbus2 (disconnected) +ukbd0: at uhub4, port 3, addr 3 (disconnected) +uhid0: at uhub4, port 3, addr 3 (disconnected) +(ada0:ahcich2:0:0:0): lost device +GEOM_ELI: Detached ada6p1.eli on last close. +(ada6:ahcich1:0:0:0): lost device +(ada6:ahcich1:0:0:0): removing device entry +ugen2.3: <Logitech> at usbus2 +ukbd0: <USB Keyboard> on usbus2 +kbd2 at ukbd0 +uhid0: <USB Keyboard> on usbus2 +ugen2.4: <Logitech> at usbus2 +ums0: <Logitech USB Optical Mouse, class 0/0, rev 2.00/72.00, addr 4> on usbus2 +ums0: 3 buttons and [XYZ] coordinates ID=0 +ugen2.3: <Logitech> at usbus2 (disconnected) +ukbd0: at uhub4, port 3, addr 3 (disconnected) +uhid0: at uhub4, port 3, addr 3 (disconnected) +ugen2.4: <Logitech> at usbus2 (disconnected) +ums0: at uhub4, port 4, addr 4 (disconnected) +swap_pager: I/O error - pagein failed; blkno 70,size 4096, error 6 +vm_fault: pager read error, pid 2840 (python) +pid 2840 (python), uid 0: exited on signal 11 freenas.WORKGROUP login failures: freenas.WORKGROUP refused connections: -- End of security output --
NOTE: The following series of messages is expected, and is caused by a KVM switch:
+ugen2.3: <Logitech> at usbus2
+ukbd0: <USB Keyboard> on usbus2
+kbd2 at ukbd0
+uhid0: <USB Keyboard> on usbus2
+ugen2.3: <Logitech> at usbus2 (disconnected)
+ukbd0: at uhub4, port 3, addr 3 (disconnected)
+uhid0: at uhub4, port 3, addr 3 (disconnected)
I have 4 computers on a KVM switch (Windows Box, 2 Linux Boxes, FreeNAS), so every time I switch from machine to machine the mouse/keyboard come and go. I have been doing this for several years without any problem.
The following messages were caused by an intentional "Detach" executed from the GUI followed by unplugging the drive.
+GEOM_ELI: Detached ada6p1.eli on last close.
+(ada6:ahcich1:0:0:0): lost device
+(ada6:ahcich1:0:0:0): removing device entry
I am not sure what this means?
+++ /tmp/security.JumAse6s 2017-08-20 03:01:02.000000000 -0400
@@ -4,7 +4,6 @@
Here are the last few lines of
dmesg
:Code:
kbd2 at ukbd0 uhid0: <USB Keyboard> on usbus2 ugen2.3: <Logitech> at usbus2 (disconnected) ukbd0: at uhub4, port 3, addr 3 (disconnected) uhid0: at uhub4, port 3, addr 3 (disconnected) (ada0:ahcich2:0:0:0): lost device GEOM_ELI: Detached ada6p1.eli on last close. (ada6:ahcich1:0:0:0): lost device (ada6:ahcich1:0:0:0): removing device entry ugen2.3: <Logitech> at usbus2 ukbd0: <USB Keyboard> on usbus2 kbd2 at ukbd0 uhid0: <USB Keyboard> on usbus2 ugen2.4: <Logitech> at usbus2 ums0: <Logitech USB Optical Mouse, class 0/0, rev 2.00/72.00, addr 4> on usbus2 ums0: 3 buttons and [XYZ] coordinates ID=0 ugen2.3: <Logitech> at usbus2 (disconnected) ukbd0: at uhub4, port 3, addr 3 (disconnected) uhid0: at uhub4, port 3, addr 3 (disconnected) ugen2.4: <Logitech> at usbus2 (disconnected) ums0: at uhub4, port 4, addr 4 (disconnected) swap_pager: I/O error - pagein failed; blkno 70,size 4096, error 6 vm_fault: pager read error, pid 2840 (python) pid 2840 (python), uid 0: exited on signal 11
Other than the device lost message there is no hint of anything happening.
After I noticed the error, I detached the backup for safekeeping and stopped the scrub with zpool scrub -s TANK
The system is sitting idle with the power still on and a console shell open so as not to destroy any information. Anything else I should check?
How can I determine why the drive became detached?
System History/Possible explanations:
It is possible that adding the hot swap disk caddy may have loosened a cable on /dev/ada0 (although the pool did run fine for a couple of days).
It is also possible a may have some freak memory error (not running ECC memory - system is built with an ASUS consumer grade motherboard.) The system has been very stable over the last year. Only one crash caused by a crash on a windows box that was connected by a CIFS share. When the Win box hung, FreeNAS got locked up and I had to reset it. Other than that been rock solid - no trouble at all.
Unless someone has a better (safer suggestion), my plan is to
- shut down, reseat the cable on ada0
- restart and see if the BIOS recognized the disk (if not RMA!)
- if disk recognized, attempt to do a smartctl -x /dev/ada0 to see what is going on (if something obvious RMA/replace disk)
- Run a long self test (if something found RMA/replace disk)
- If nothing found to this point run memtest for an hour or so
- If all ok zpool clear TANK and run another scrub.
- If OK return system to production and observe carefully for a week or two
Thanks in advance for any assistance/suggestions/recommendations.