Rokkahorra
Cadet
- Joined
- Mar 6, 2024
- Messages
- 1
Thanks in advance for your assistance!
So a couple weeks ago, my main zpool crashed unexpectedly and I'm just curious as to 1) why? and 2) can it be undone? Data isn't exactly mission-critical, though would like to save it if possible, but willing to scrap it and rebuild if that's not possible. The pool is z2 with four vdevs each composed of eight 4TB disks. When I got home and logged in, the pool was showing up as exported and unable to be imported due to five of the disks on a single vdev being unavailable. I can hear the disks spin up when they're connected, but they don't show up in TrueNAS, or in the LSI SAS Topology on boot. Unfortunately I do not have any offsite backup available, otherwise I would just swap out the disks and restore from the backup.
I have changed both SAS cables and power splitting cables, swapped expander cards, SAS controller cards, shuffled around HDDs to different positions to no avail.
Going through /var/log/messages I can't see anything specifically where FIVE disks failed, but I don't understand exactly what the logs are saying before the first time it booted up with the failed pool. Here's the excerpt of the log that I'm unclear exactly what it's telling me took place:
As far as I can tell, there is a drive that appeared to be in the process of failing, but I don't know why it would make the four disks around it fail, nor do I see any report of them failing in the logs.
I'm hoping someone might be able to illuminate exactly what happened and whether or not there's any reason the SAS controller wouldn't be able to read the afflicted disks logically? Otherwise I'm guessing it was some sort of freak power surge, or something like it, that knocked out five disks on the same vdev at once.
I'll attach the full log file as well, I believe the pool failed over Feb 18-19, and on Feb 24 it booted up w/o the disks in question. I really would like to understand what went wrong, even if it's beyond recovering, before I scrap the pool and rebuild it. Any assistance or information is appreciated :-D
TLDR: raidz2pool w/ 4x VDEVs with 8x 4TB drives; 5 out of 8 disks on a VDEV are currently listed as "UNAVAILABLE" making it unrecoverable due to insufficient replicas
So a couple weeks ago, my main zpool crashed unexpectedly and I'm just curious as to 1) why? and 2) can it be undone? Data isn't exactly mission-critical, though would like to save it if possible, but willing to scrap it and rebuild if that's not possible. The pool is z2 with four vdevs each composed of eight 4TB disks. When I got home and logged in, the pool was showing up as exported and unable to be imported due to five of the disks on a single vdev being unavailable. I can hear the disks spin up when they're connected, but they don't show up in TrueNAS, or in the LSI SAS Topology on boot. Unfortunately I do not have any offsite backup available, otherwise I would just swap out the disks and restore from the backup.
I have changed both SAS cables and power splitting cables, swapped expander cards, SAS controller cards, shuffled around HDDs to different positions to no avail.
Going through /var/log/messages I can't see anything specifically where FIVE disks failed, but I don't understand exactly what the logs are saying before the first time it booted up with the failed pool. Here's the excerpt of the log that I'm unclear exactly what it's telling me took place:
Code:
Feb 18 00:00:15 truenas kernel: sd 0:0:20:0: [sdw] tag#2816 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s Feb 18 00:00:15 truenas kernel: sd 0:0:20:0: [sdw] tag#2816 Sense Key : Hardware Error [current] Feb 18 00:00:15 truenas kernel: sd 0:0:20:0: [sdw] tag#2816 Add. Sense: Internal target failure Feb 18 00:00:15 truenas kernel: sd 0:0:20:0: [sdw] tag#2816 CDB: Read(16) 88 00 00 00 00 01 d1 c0 bc 90 00 00 00 10 00 00 Feb 18 00:00:15 truenas kernel: zio pool=tank1 vdev=/dev/disk/by-partuuid/95ee6edf-ebe0-4812-bcd0-4e18a842b36c error=121 type=1 offset=3998639202304 size=8192 flags=b08c1 Feb 19 00:38:47 truenas syslog-ng[138101]: Configuration reload request received, reloading configuration; Feb 19 00:38:47 truenas syslog-ng[138101]: Configuration reload finished; Feb 19 00:38:57 truenas syslog-ng[138101]: Configuration reload request received, reloading configuration; Feb 19 00:38:57 truenas syslog-ng[138101]: Configuration reload finished; Feb 19 00:38:16 truenas zed[138048]: Missed 703 events Feb 19 00:38:16 truenas zed[138048]: Bumping queue length to 1073741824 Feb 19 01:20:59 truenas kernel: zed invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO), order=2, oom_score_adj=0 Feb 19 01:20:59 truenas kernel: CPU: 8 PID: 138048 Comm: zed Tainted: P OE 5.15.79+truenas #1 Feb 19 01:20:59 truenas kernel: Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.5b 03/25/2016 Feb 19 01:20:59 truenas kernel: Call Trace: Feb 19 01:20:59 truenas kernel: <TASK> Feb 19 01:20:59 truenas kernel: dump_stack_lvl+0x46/0x5e Feb 19 01:20:59 truenas kernel: dump_header+0x4a/0x1f4 Feb 19 01:20:59 truenas kernel: oom_kill_process.cold+0xb/0x10 Feb 19 01:20:59 truenas kernel: out_of_memory+0x1bd/0x4f0 Feb 19 01:20:59 truenas kernel: __alloc_pages_slowpath.constprop.0+0xc30/0xd00 Feb 19 01:20:59 truenas kernel: __alloc_pages+0x1e9/0x220 Feb 19 01:20:59 truenas kernel: kmalloc_large_node+0x40/0xb0 Feb 19 01:20:59 truenas kernel: __kmalloc_node+0x3d6/0x480 Feb 19 01:20:59 truenas kernel: spl_kmem_alloc_impl+0x79/0xc0 [spl] Feb 19 01:20:59 truenas kernel: zfsdev_ioctl+0x28/0xe0 [zfs] Feb 19 01:20:59 truenas kernel: __x64_sys_ioctl+0x8b/0xc0 Feb 19 01:20:59 truenas kernel: do_syscall_64+0x3b/0xc0 Feb 19 01:20:59 truenas kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb Feb 19 01:20:59 truenas kernel: RIP: 0033:0x7fe77ea716b7 Feb 19 01:20:59 truenas kernel: Code: Unable to access opcode bytes at RIP 0x7fe77ea7168d. Feb 19 01:20:59 truenas kernel: RSP: 002b:00007ffc3e52a2c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Feb 19 01:20:59 truenas kernel: RAX: ffffffffffffffda RBX: 00007ffc3e52d8e0 RCX: 00007fe77ea716b7 Feb 19 01:20:59 truenas kernel: RDX: 00007ffc3e52a2e0 RSI: 0000000000005a81 RDI: 0000000000000005 Feb 19 01:20:59 truenas kernel: RBP: 00007ffc3e52d8d0 R08: 0000561f9c52f170 R09: 00007fe77eb4ebe0 Feb 19 01:20:59 truenas kernel: R10: 0000000000000040 R11: 0000000000000246 R12: 0000000000000000 Feb 19 01:20:59 truenas kernel: R13: 0000561f9c511d40 R14: 00007ffc3e52a2e0 R15: 00007ffc3e52d8e8 Feb 19 01:20:59 truenas kernel: </TASK> Feb 19 01:20:59 truenas kernel: Mem-Info: Feb 19 01:20:59 truenas kernel: active_anon:5028 inactive_anon:201908 isolated_anon:1048 active_file:2513 inactive_file:327 isolated_file:30 unevictable:960 dirty:545 writeback:6302 slab_reclaimable:31428 slab_unreclaimable:63596847 mapped:2884 shmem:972 pagetables:18848 bounce:0 kernel_misc_reclaimable:0 free:323633 free_pcp:187 free_cma:14449 Feb 19 01:20:59 truenas kernel: Node 0 active_anon:8696kB inactive_anon:198172kB active_file:5696kB inactive_file:2128kB unevictable:3840kB isolated(anon):2692kB isolated(file):120kB mapped:4884kB dirty:860kB writeback:10348kB shmem:3868kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 16384kB writeback_tmp:0kB kernel_stack:7936kB pagetables:6616kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 1 active_anon:0kB inactive_anon:85304kB active_file:0kB inactive_file:100kB unevictable:0kB isolated(anon):56kB isolated(file):0kB mapped:0kB dirty:0kB writeback:288kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 30720kB writeback_tmp:0kB kernel_stack:6992kB pagetables:11448kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 2 active_anon:0kB inactive_anon:35476kB active_file:0kB inactive_file:16kB unevictable:0kB isolated(anon):20kB isolated(file):0kB mapped:0kB dirty:0kB writeback:156kB shmem:4kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 16384kB writeback_tmp:0kB kernel_stack:7456kB pagetables:5376kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 3 active_anon:0kB inactive_anon:18048kB active_file:48kB inactive_file:8kB unevictable:0kB isolated(anon):4kB isolated(file):0kB mapped:8kB dirty:4kB writeback:216kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12288kB writeback_tmp:0kB kernel_stack:4340kB pagetables:3940kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 4 active_anon:1000kB inactive_anon:32744kB active_file:116kB inactive_file:644kB unevictable:0kB isolated(anon):236kB isolated(file):0kB mapped:1084kB dirty:56kB writeback:1616kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 14336kB writeback_tmp:0kB kernel_stack:4916kB pagetables:8560kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 5 active_anon:96kB inactive_anon:59992kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(anon):20kB isolated(file):0kB mapped:0kB dirty:8kB writeback:128kB shmem:4kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB kernel_stack:8324kB pagetables:4020kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 6 active_anon:4160kB inactive_anon:98884kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):256kB isolated(file):0kB mapped:3520kB dirty:88kB writeback:1624kB shmem:8kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 36864kB writeback_tmp:0kB kernel_stack:5844kB pagetables:5500kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 7 active_anon:6764kB inactive_anon:278316kB active_file:5276kB inactive_file:1424kB unevictable:0kB isolated(anon):780kB isolated(file):0kB mapped:1180kB dirty:1164kB writeback:10260kB shmem:4kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 159744kB writeback_tmp:0kB kernel_stack:11760kB pagetables:29932kB all_unreclaimable? no Feb 19 01:20:59 truenas kernel: Node 0 DMA free:7136kB min:12kB low:24kB high:36kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15952kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 3217 32015 32015 32015 Feb 19 01:20:59 truenas kernel: Node 0 DMA32 free:119680kB min:4860kB low:8152kB high:11444kB reserved_highatomic:2048KB active_anon:0kB inactive_anon:2332kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3520960kB managed:3455424kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 28797 28797 28797 Feb 19 01:20:59 truenas kernel: Node 0 Normal free:159180kB min:145240kB low:174728kB high:204216kB reserved_highatomic:2048KB active_anon:9656kB inactive_anon:196480kB active_file:5332kB inactive_file:1920kB unevictable:3840kB writepending:11036kB present:30015488kB managed:29489076kB mlocked:0kB bounce:0kB free_pcp:728kB local_pcp:0kB free_cma:57976kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 1 Normal free:125172kB min:147024kB low:180052kB high:213080kB reserved_highatomic:2048KB active_anon:84kB inactive_anon:84984kB active_file:300kB inactive_file:0kB unevictable:0kB writepending:288kB present:33554432kB managed:33028028kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 2 Normal free:88740kB min:140828kB low:173856kB high:206884kB reserved_highatomic:0KB active_anon:0kB inactive_anon:35504kB active_file:0kB inactive_file:8kB unevictable:0kB writepending:156kB present:33554432kB managed:33028032kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 3 Normal free:158964kB min:169352kB low:202340kB high:235328kB reserved_highatomic:0KB active_anon:0kB inactive_anon:18488kB active_file:104kB inactive_file:0kB unevictable:0kB writepending:220kB present:33554432kB managed:32998116kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 4 Normal free:182180kB min:169552kB low:202580kB high:235608kB reserved_highatomic:2048KB active_anon:3040kB inactive_anon:31624kB active_file:884kB inactive_file:356kB unevictable:0kB writepending:1672kB present:33554432kB managed:33028032kB mlocked:0kB bounce:0kB free_pcp:20kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 5 Normal free:96564kB min:169552kB low:202580kB high:235608kB reserved_highatomic:0KB active_anon:16kB inactive_anon:60072kB active_file:0kB inactive_file:8kB unevictable:0kB writepending:136kB present:33554432kB managed:33028028kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 6 Normal free:186896kB min:169552kB low:202580kB high:235608kB reserved_highatomic:2048KB active_anon:3528kB inactive_anon:99508kB active_file:60kB inactive_file:0kB unevictable:0kB writepending:1712kB present:33554432kB managed:33028032kB mlocked:0kB bounce:0kB free_pcp:216kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 7 Normal free:172552kB min:169532kB low:202556kB high:235580kB reserved_highatomic:2048KB active_anon:6860kB inactive_anon:278264kB active_file:5188kB inactive_file:1804kB unevictable:0kB writepending:11228kB present:33554432kB managed:33025640kB mlocked:0kB bounce:0kB free_pcp:1636kB local_pcp:0kB free_cma:0kB Feb 19 01:20:59 truenas kernel: lowmem_reserve[]: 0 0 0 0 0 Feb 19 01:20:59 truenas kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (E) 1*64kB (E) 1*128kB (E) 1*256kB (E) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 7136kB
As far as I can tell, there is a drive that appeared to be in the process of failing, but I don't know why it would make the four disks around it fail, nor do I see any report of them failing in the logs.
I'm hoping someone might be able to illuminate exactly what happened and whether or not there's any reason the SAS controller wouldn't be able to read the afflicted disks logically? Otherwise I'm guessing it was some sort of freak power surge, or something like it, that knocked out five disks on the same vdev at once.
I'll attach the full log file as well, I believe the pool failed over Feb 18-19, and on Feb 24 it booted up w/o the disks in question. I really would like to understand what went wrong, even if it's beyond recovering, before I scrap the pool and rebuild it. Any assistance or information is appreciated :-D
TLDR: raidz2pool w/ 4x VDEVs with 8x 4TB drives; 5 out of 8 disks on a VDEV are currently listed as "UNAVAILABLE" making it unrecoverable due to insufficient replicas