jolness1
Dabbler
- Joined
- May 21, 2020
- Messages
- 29
First off, Happy New Years (or almost New Year depending on when and where you read this)
This morning I rebooted my Macbook to install an update. When it loaded into the desktop I got an error about being unable to connect to my NFS share on my TrueNAS box. Tried to manually connect and no dice. So I decide I will load into the Web GUI and see what's going on, maybe reboot the system. Web GUI wasn't loading so I tried ssh, no luck there either. Then load into the IPMI and everything is working fine so I reboot, figuring its just some weird edge case bug that I have a knack for finding.
Upon reboot I get "failed command: WRITE FPDMA QUEUED" errors on ATA 10 (one of 2 SATA DOM ports on my board). Try rebooting again and the drive is not detected and the board tries to boot from LAN (which of course doesn't work). Reboot again, manually load into the boot menu and the drive is there now. Load in and after several cycles of this decide to boot using the debug mode. I get a "error: checksum verification failed" on loading initial ramdisk.
At this point I suspect the SATA DOM has failed, would be surprising as it's only a few years old but, again, I have a knack for having uncommon issues. No worries, I have a spare that I had been meaning to set up as a mirror (which I have a hunch wouldn't help here but still should have done it). Upon trying to install TrueNAS Scale to this new Sata DOM (in ATA10 or port I-SATA 5 as it's called in Supermicro's manual) and get a "COMRESET failed (errno=-16)" Which I had seen earlier as well. Now I begin to suspect the port is dead. No big deal, there is another port (I-SATA 4 or ATA9) for DOM so I try that. This yields the same issue.
Out of curiosity I try to install TrueNAS anyways on the new SATA DOM both on ATA9 and ATA10, with the same result (having unhooked all other drives in my system, data pool and an SSD I use for running VMs and applications just to be safe) and it appears the commands go through sometimes as it will eventually begin to install and then throw errors about failed commands again.
At this point I am thinking the mainboard has an issue, the supermicro documentation says these SATA ports run off of the chipset so I am thinking the failure may be there as I know both ports worked when I built this server 3.5yrs ago and it seems unlikely (albeit possible) that they both failed. Visually there are no issues I can see with the ports or anything else on the board.
Apologies for the long winded explanation, I have been at this for a long time and am trying to make sure I give as much information as possible. I would like to be able to conclusively (or somewhat) diagnose the issue before firing the parts cannon at the problem. A replacement board is going to cost more than what it did a few years ago but an entirely new system with new RAM, CPU and DRAM looks like it'll be substantially more and with the layoff (part of being in software right now it seems) I am extra conscious of fixing this issue as cheaply as possible but also want to actually fix it in a way that isn't "I'll just jerry rig this until it blows up again". Photos attached are in chronological order, some are somewhat redundant but this way if there is any further information I have that will help someone who knows more than I get me pointed in the right direction.
I have a the following questions:
1) Both SATA DOM modules I have come with external power cables (one uses USB2 header power, the other the port supermicro puts on the board specifically for the SATA DOM). Would I be able to try them in another port? It seems the answer is yes and since it's just power in that the DOM is looking for, there shouldn't be any risk but I don't want to totally toast the board.
2) Any tips for diagnosing the chipset issue more definitively?
3) Should I just get a boot SSD and run it off the HBA or if there is an issue with the chipset should I just move on from this motherboard before I risk it causing an issue with the data? I have it backed up on and offsite but if I can avoid some sort of issue, I would prefer that there are enough things that can go wrong even when observing best practices.
Thanks in advance, please let me know if I can clarify anything, apologies again for the ramble, combination of exhaustion and panic are not helping me at all at the moment.
System Specs (I realize some may not be relevant but just want to make sure I am giving any info that might help):
TrueNAS SCALE 23.10.1 (*edit* forgot to put this on)
Supermicro X10-DRL-i
Xeon 2660v3
64GB (4x16GB) ECC 2133mhz Samsung DDR3 (M393A2G40DB0)
LSI SAS3008 HBA card (data pool hooked up to this)
5 HGST He12 Drives
This morning I rebooted my Macbook to install an update. When it loaded into the desktop I got an error about being unable to connect to my NFS share on my TrueNAS box. Tried to manually connect and no dice. So I decide I will load into the Web GUI and see what's going on, maybe reboot the system. Web GUI wasn't loading so I tried ssh, no luck there either. Then load into the IPMI and everything is working fine so I reboot, figuring its just some weird edge case bug that I have a knack for finding.
Upon reboot I get "failed command: WRITE FPDMA QUEUED" errors on ATA 10 (one of 2 SATA DOM ports on my board). Try rebooting again and the drive is not detected and the board tries to boot from LAN (which of course doesn't work). Reboot again, manually load into the boot menu and the drive is there now. Load in and after several cycles of this decide to boot using the debug mode. I get a "error: checksum verification failed" on loading initial ramdisk.
At this point I suspect the SATA DOM has failed, would be surprising as it's only a few years old but, again, I have a knack for having uncommon issues. No worries, I have a spare that I had been meaning to set up as a mirror (which I have a hunch wouldn't help here but still should have done it). Upon trying to install TrueNAS Scale to this new Sata DOM (in ATA10 or port I-SATA 5 as it's called in Supermicro's manual) and get a "COMRESET failed (errno=-16)" Which I had seen earlier as well. Now I begin to suspect the port is dead. No big deal, there is another port (I-SATA 4 or ATA9) for DOM so I try that. This yields the same issue.
Out of curiosity I try to install TrueNAS anyways on the new SATA DOM both on ATA9 and ATA10, with the same result (having unhooked all other drives in my system, data pool and an SSD I use for running VMs and applications just to be safe) and it appears the commands go through sometimes as it will eventually begin to install and then throw errors about failed commands again.
At this point I am thinking the mainboard has an issue, the supermicro documentation says these SATA ports run off of the chipset so I am thinking the failure may be there as I know both ports worked when I built this server 3.5yrs ago and it seems unlikely (albeit possible) that they both failed. Visually there are no issues I can see with the ports or anything else on the board.
Apologies for the long winded explanation, I have been at this for a long time and am trying to make sure I give as much information as possible. I would like to be able to conclusively (or somewhat) diagnose the issue before firing the parts cannon at the problem. A replacement board is going to cost more than what it did a few years ago but an entirely new system with new RAM, CPU and DRAM looks like it'll be substantially more and with the layoff (part of being in software right now it seems) I am extra conscious of fixing this issue as cheaply as possible but also want to actually fix it in a way that isn't "I'll just jerry rig this until it blows up again". Photos attached are in chronological order, some are somewhat redundant but this way if there is any further information I have that will help someone who knows more than I get me pointed in the right direction.
I have a the following questions:
1) Both SATA DOM modules I have come with external power cables (one uses USB2 header power, the other the port supermicro puts on the board specifically for the SATA DOM). Would I be able to try them in another port? It seems the answer is yes and since it's just power in that the DOM is looking for, there shouldn't be any risk but I don't want to totally toast the board.
2) Any tips for diagnosing the chipset issue more definitively?
3) Should I just get a boot SSD and run it off the HBA or if there is an issue with the chipset should I just move on from this motherboard before I risk it causing an issue with the data? I have it backed up on and offsite but if I can avoid some sort of issue, I would prefer that there are enough things that can go wrong even when observing best practices.
Thanks in advance, please let me know if I can clarify anything, apologies again for the ramble, combination of exhaustion and panic are not helping me at all at the moment.
System Specs (I realize some may not be relevant but just want to make sure I am giving any info that might help):
TrueNAS SCALE 23.10.1 (*edit* forgot to put this on)
Supermicro X10-DRL-i
Xeon 2660v3
64GB (4x16GB) ECC 2133mhz Samsung DDR3 (M393A2G40DB0)
LSI SAS3008 HBA card (data pool hooked up to this)
5 HGST He12 Drives
Attachments
-
Screenshot 2023-12-31 at 10.29.45 AM.png312.5 KB · Views: 132
-
Screenshot 2023-12-31 at 11.34.18 AM.png20.1 KB · Views: 135
-
Screenshot 2023-12-31 at 11.02.08 AM.png223.4 KB · Views: 131
-
Screenshot 2023-12-31 at 11.01.00 AM.png308.8 KB · Views: 133
-
Screenshot 2023-12-31 at 10.47.06 AM.png240.2 KB · Views: 136
-
image.png73.2 KB · Views: 151
Last edited: