Andrew Barnes
Dabbler
- Joined
- Dec 4, 2014
- Messages
- 21
Hi,
I noticed yesterday that my pool was degraded (rookie mistake #1, not setting up email notifications) - however it must be recent development as I've been using the GUI a lot lately.
The pool is a RAIDz2 over 6x 3TB WD REDS. The FreeNAS server is 16GB, LSI M1015 HBA p16 IT MODE, 1 socket 4 core (2 threads) 3.5-3.9Ghz Haswell (E3).
It is running in a QEMU/KVM (type1 hypervisor) with directly assigned HBA - I.E, FreeNAS has the entire HBA just like bare-metal (with a little performance loss on interrupt handling).
The host is a SM x10-SLL-+F, E3-1241, 32GB ECC Samsung (SM QVL), SM SC745 Chassis, SM 500W PSU (or more, can't remember!).
Its non-production - just my personal project. All data that needs to be backed up is snapshot regularly and replicated to a HP Microserver GEN8 running FreeNAS. However, there is still data here that I'd really prefer to save (ripped movies etc. I don't back this up because of size and because I have the originals - but its still a big job to re-rip it all).
About a week ago I re-silvered (online,parallel) one 3TB WD GREEN to replace it for a 3TB WD RED. No problems. Previous to that event, there was one successful scrub - no errors.
(replaced the file paths with a number)
Degraded drives are /dev/da1 and /dev/da4
What I did:
1. deleted parenting folders for each file (removed the entire album). and the first item is a iscsi device used by a VM that was having I/O issues. I removed all iscsi config for that zvol, all snapshots, and finally the zvol was removed.
2. cleared the zpool status
3. ran short smart test (on all drives) - NO ERRORS
After removing the damaged files, the zpool status still refered to 13 files, but without complete paths.
I proceeded...
4. scrub the pool
on completion I now have:
At this point several things concern me.
1) if the pool is now okay... as its says no errors and online. why status!=state ?
2) how come the two drives that gave checksum errors and-thus-degraded before scrub are now the only two drives without checksum errors?
3) why are so many drives with checksum issues... is it because I've not updated firmware to p20? Could I have a memory fault?
4) Am I asking the right questions! I don't know!
I would be very keen to hear how others interpret this, I hope I've provided all relevant info.
Thanks,
Andy
I noticed yesterday that my pool was degraded (rookie mistake #1, not setting up email notifications) - however it must be recent development as I've been using the GUI a lot lately.
The pool is a RAIDz2 over 6x 3TB WD REDS. The FreeNAS server is 16GB, LSI M1015 HBA p16 IT MODE, 1 socket 4 core (2 threads) 3.5-3.9Ghz Haswell (E3).
It is running in a QEMU/KVM (type1 hypervisor) with directly assigned HBA - I.E, FreeNAS has the entire HBA just like bare-metal (with a little performance loss on interrupt handling).
The host is a SM x10-SLL-+F, E3-1241, 32GB ECC Samsung (SM QVL), SM SC745 Chassis, SM 500W PSU (or more, can't remember!).
Its non-production - just my personal project. All data that needs to be backed up is snapshot regularly and replicated to a HP Microserver GEN8 running FreeNAS. However, there is still data here that I'd really prefer to save (ripped movies etc. I don't back this up because of size and because I have the originals - but its still a big job to re-rip it all).
About a week ago I re-silvered (online,parallel) one 3TB WD GREEN to replace it for a 3TB WD RED. No problems. Previous to that event, there was one successful scrub - no errors.
Code:
pool: zeta state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: resilvered 1.68T in 10h52m with 0 errors on Wed Mar 2 04:29:58 2016 config: NAME STATE READ WRITE CKSUM zeta DEGRADED 0 0 23 raidz2-0 DEGRADED 0 0 46 gptid/468ca5ae-dfd4-11e5-8d7a-006f62630200.eli ONLINE 0 0 0 gptid/be9593cc-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 2 gptid/befd712f-e1d5-11e4-9466-006f62630200.eli DEGRADED 0 0 5 too many errors gptid/bf6c1201-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 1 gptid/bfe1483a-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 2 gptid/c055ce20-e1d5-11e4-9466-006f62630200.eli DEGRADED 0 0 7 too many errors errors: Permanent errors have been detected in the following files: zeta/vm/asterix:<0x1> /mnt/zeta/media/1.flac /mnt/zeta/media/2.flac /mnt/zeta/media/3.flac /mnt/zeta/media/4.flac /mnt/zeta/media/5.flac /mnt/zeta/media/6.flac /mnt/zeta/media/7.flac /mnt/zeta/media/8.flac /mnt/zeta/media/9.flac /mnt/zeta/media/10.flac /mnt/zeta/media/11.flac /mnt/zeta/media/12.flac
(replaced the file paths with a number)
Degraded drives are /dev/da1 and /dev/da4
What I did:
1. deleted parenting folders for each file (removed the entire album). and the first item is a iscsi device used by a VM that was having I/O issues. I removed all iscsi config for that zvol, all snapshots, and finally the zvol was removed.
2. cleared the zpool status
3. ran short smart test (on all drives) - NO ERRORS
After removing the damaged files, the zpool status still refered to 13 files, but without complete paths.
Code:
errors: Permanent errors have been detected in the following files: <0x210>:<0x1> zeta/media/music:<0x2d1a> zeta/media/music:<0xb40> zeta/media/music:<0x2c59> zeta/media/music:<0x2c60> zeta/media/music:<0x296f> zeta/media/music:<0x2c7f> zeta/media/music:<0x28ba> zeta/media/music:<0x28bb> zeta/media/music:<0x2abc> zeta/media/music:<0x2abe> zeta/media/music:<0x2abf> zeta/media/music:<0x2ac3>
I proceeded...
4. scrub the pool
on completion I now have:
Code:
zpool status -v pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Sun Feb 28 03:45:12 2016 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 vtbd0p2 ONLINE 0 0 0 errors: No known data errors pool: zeta state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 840K in 7h58m with 0 errors on Mon Mar 14 08:34:42 2016 config: NAME STATE READ WRITE CKSUM zeta ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/468ca5ae-dfd4-11e5-8d7a-006f62630200.eli ONLINE 0 0 6 gptid/be9593cc-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 8 gptid/befd712f-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 0 gptid/bf6c1201-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 9 gptid/bfe1483a-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 5 gptid/c055ce20-e1d5-11e4-9466-006f62630200.eli ONLINE 0 0 0 errors: No known data errors
At this point several things concern me.
1) if the pool is now okay... as its says no errors and online. why status!=state ?
2) how come the two drives that gave checksum errors and-thus-degraded before scrub are now the only two drives without checksum errors?
3) why are so many drives with checksum issues... is it because I've not updated firmware to p20? Could I have a memory fault?
4) Am I asking the right questions! I don't know!
I would be very keen to hear how others interpret this, I hope I've provided all relevant info.
Thanks,
Andy