Hello,
I'd need your help to troubleshoot an issue we have with a FreeNAS backup server.
I'll go into some details later, but in brief, from time to time there is some error accessing some of the disks, which ends up with the host aborting the request, and detaching the disks, usually 2 and sometimes 4 of them, all at the same time... only to re-detect and re-attach them only 10-15 seconds later.
So even if we have RAIDZ2, that often causes a some corruption.
My question is, what could be cause this type of failure?
Here are some details on the system, which is running FreeNAS-9.10.2-U5 (yes, I know it's a bit old):
zpool configuration:
In attachment there is a portion of the messages log when one of such disconnect/reconnect events happened.
Looking at the disks S/N in the various logs throughout all events, it seems it's always the same 4 disks (although sometimes only 2 of them), and they are all located in the vdev raidz2-0.
Interestingly, they are also the 4 disks carried by port #3 on the SAS expander, and for that reason, suspecting a faulty cable or connection,
yesterday I replaced it with a new SFF-8087 terminated cable, but it seems that did not help as the issue still persists.
Disk SMART tests don't show any failure.
So, I'm a little puzzled at the multiple disks failure (which anyway it doesn't seem to be permanent, as disks will happily rejoin the pool shortly after). What in your opinion would be the cause, or at least what you wouldn't consider as a primary suspect?
Thanks
I'd need your help to troubleshoot an issue we have with a FreeNAS backup server.
I'll go into some details later, but in brief, from time to time there is some error accessing some of the disks, which ends up with the host aborting the request, and detaching the disks, usually 2 and sometimes 4 of them, all at the same time... only to re-detect and re-attach them only 10-15 seconds later.
So even if we have RAIDZ2, that often causes a some corruption.
My question is, what could be cause this type of failure?
Here are some details on the system, which is running FreeNAS-9.10.2-U5 (yes, I know it's a bit old):
board: Supermicro H8QG6
memory: 80 GB
HBA controller: LSI 2308
SAS expander: Intel RES2SV240
volume disks: 20x Seagate NAS HDD ST8000VN0002
other disks: 2x Intel S3700 100GB for ZIP and 2x Samsung SM863 240GB for L2ARC
zpool configuration:
Code:
pool: vol1 state: ONLINE scan: scrub in progress since Fri Jul 5 12:01:35 2019 5.67G scanned out of 30.5T at 10.7M/s, (scan is slow, no estimated time) 0 repaired, 0.02% done config: NAME STATE READ WRITE CKSUM vol1 ONLINE 0 94 0 raidz2-0 ONLINE 0 188 0 gptid/49d0dd58-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/4ad07498-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/4bd35639-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/4cd2284c-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/4ddefa41-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/4ee2bdf9-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/4fdf1223-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/50ecf07b-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/51f11d2c-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/52fb6848-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 gptid/54096b3c-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/55106d03-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/561d43f4-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/5729d98b-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/582acafb-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/59506a31-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/5a6461cc-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/5b71ce0f-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/5c893896-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/5d9d1ed0-96b2-11e6-b0d1-002590585018 ONLINE 0 0 0 logs mirror-2 ONLINE 0 0 0 gptid/70e566a7-96b3-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/7146cdd6-96b3-11e6-b0d1-002590585018 ONLINE 0 0 0 cache gptid/230deadc-96b4-11e6-b0d1-002590585018 ONLINE 0 0 0 gptid/2360c322-96b4-11e6-b0d1-002590585018 ONLINE 0 0 0
In attachment there is a portion of the messages log when one of such disconnect/reconnect events happened.
Looking at the disks S/N in the various logs throughout all events, it seems it's always the same 4 disks (although sometimes only 2 of them), and they are all located in the vdev raidz2-0.
Interestingly, they are also the 4 disks carried by port #3 on the SAS expander, and for that reason, suspecting a faulty cable or connection,
yesterday I replaced it with a new SFF-8087 terminated cable, but it seems that did not help as the issue still persists.
Disk SMART tests don't show any failure.
So, I'm a little puzzled at the multiple disks failure (which anyway it doesn't seem to be permanent, as disks will happily rejoin the pool shortly after). What in your opinion would be the cause, or at least what you wouldn't consider as a primary suspect?
Thanks