Eating boot drives

Urumiko · Feb 15, 2023

Right. Game plan: There are 2 sata controllers onboard but one hosts the 4 data drives, the other only has a single port.
There is potential for the long sata cable to get caught in the case. Will do as follows:

Try another SATA cable
Try updating controller firmware and re-install
try diagnostics on the drive in another machine or using System rescue.
possibly try yet another drive
Use internal USB port to host USB SSD as opposed to mem stick.

Sound plan like?

Whattteva · Feb 15, 2023

WI_Hedgehog said:
Except you lose S.M.A.R.T. reporting because TrueNAS or smartctl will poll the array instead of the drives.

Good point. That is indeed the trade-off for that setup.

Urumiko · Feb 15, 2023

hmm ok, so I tried a new SATA cable and I initially thought it fixed the issue, either that or having cooled down from being powered off helped. It booted and accepted config ok, but then I found it was rebooting after a few minutes. It boots fine every time and works for a few minutes, but then reboots. It is difficult to catch the console output but I think it detects the swap partition going offline or becoming corrupt and reboots.
SO yup...Drive or controller issue I am guessing. Will persevere...

WI_Hedgehog · Feb 16, 2023

Urumiko said:
Right. Game plan: There are 2 sata controllers onboard but one hosts the 4 data drives, the other only has a single port.
There is potential for the long sata cable to get caught in the case. Will do as follows:

Try another SATA cable

Try updating controller firmware and re-install

try diagnostics on the drive in another machine or using System rescue.

possibly try yet another drive

Use internal USB port to host USB SSD as opposed to mem stick.

Sound plan like?

I'm not sure if the initial install went in clean, what hardware/software/firmware you're running and if there are conflicts, if cooling has always been adequate or something got cooked at some point (including thermal compound), the "internal dust volume" situation, if all your power filter capacitors are in good shape, how well-controlled the environment is (especially humidity), the quality of your add-on parts (did use use discount-barrel power breakout cables), etc.

Bottom line: If you didn't burn-in your system, off-load your data to a backup server and build this server from scratch, assuming every component is junk until proven otherwise. I assume every brand-new-in-the-box straight-from-the-OEM piece of equipment is defective until tested and proven good, at which point it gets an Asset Tag with its life-history logged in an Excel spreadsheet (because spreadsheets always work, database software becomes outdated in 7 years and is no longer accessible). Most people won't invest the time up-front doing this, but lemmetellyou,* this is an invaluable resource when troubleshooting complex systems.

---
*lemmetellyou (verb): A verbal cue signalling importace. Used when the speaker is fairly certain the information being conveyed will be ignored, to the detriment of the intended recipient.

Urumiko · Feb 16, 2023

The system was bought new about 10 years ago, so yes it is reaching the age where caps can go squiffy but it would be odd to manifest in this way only. I am handy with a multimeter etc but I usually work on old valve amps. Temps, dust etc all good. System has been in the same physical config all that time, 4xnas drives, 16gb hp ECC ram, and a single boot drive in the cd bay. I originally used usb/onboard microsd which was obviously not suitable. It's difficult to recall life span of the SATA drives I used but I think they were decent. I'll check if the ILO or bios has onboard voltage readouts. Might also look in to whether the bios supports pcie boot. Probably not. Tonight's plan is to try the failed drives in other machines and run diags, and try a live usb stick in that machine and see if it can update the sata firmware.

Urumiko · Feb 16, 2023

Oh. And I've ordered a third drive this time from a leading parts vendor instead of Amazon.

Whattteva · Feb 16, 2023

Urumiko said:
Oh. And I've ordered a third drive this time from a leading parts vendor instead of Amazon.

Be interested to know if that really is the issue. It would certainly really surprise me. Do keep us posted.

Urumiko · Feb 16, 2023

Whattteva said:
Be interested to know if that really is the issue. It would certainly really surprise me. Do keep us posted.

It's a conundrum. I really thought it was due to something paging over and over and wearing out the flash but I don't know enough about The os, or flash drives to know if that's a valid concern. Don't know if the sata controller firmware could impact on that? I'm pretty sure last time I installed before all this trouble I had no swap partition set up for that reason. I am reassured by the number of people having no such issues though.

Urumiko · Feb 16, 2023

Hi all thanks for your continued support. I think im getting somewhere with this but its complicated so bear with me....

SO, This is crystal disk output from a drive I thought had failed. This is the drive I was running with for quite some time before the recent bout of failures, and was the first I thought had failed:

I tried doing a full format in windows and copying some large files over with teracopy and verifying the copied data Looks OK to me:

Back on the problem server / drive I installed windows 10,
Ran updates, ran chkdisk, left it a while. Seems fine.

Unfortunately my bios is already most recent and I cannot see any explicit firmware update for the storage controller in question. The storage array firmware which is available appears to be for a raid add in card not the onboard chip.

I decided to pull the existing drive Here is it's smart data:

This led me to my current Rabbit hole.
When googling SATA CRC errors on seemingly none broken kit I came across posts talking about firmware incompatibilities between SSD drives and the controller firmware. ..... I saw suggestions of running the drive in IDE mode.

When I checked out my BIOS I saw that the SATA mode was already set to Legacy so changed it to AHCI.
Upon reboot I get a totally different firmware splash/readout which is encouraging. however weirdly in this mode All 4 data drives and the boot drive now appear under 1 single controller not 2 separate ones as before.
Then I remember why its sat that way. The HP Bios does not allow you to select which drive to boot from and will always just select the first drive on the selected controller. The standalone SATA port is actually just intended for an optical drive despite everyone using it for an SSD so they only expect a single boot pool.

So for the time being I have managed to get back online by booting the SSD attached to USB via a cheap caddy backplane and choosing USB boot. In theory as the server has an internal USB port I could live with this and get a slightly more robust adapter or purpose build USB drive.

I plan to stay online in this configuration for a day or 2 and resilver before any more major upsets.
However I'm looking at other options but its difficult to predict system behaviour.
I want to avoid HP raid cards. I don't want to use hardware Raid on my data drives.
Any other form of PCIE storage controller would likely be non bootable.

SO option 1:
I noticed that the data drives are actually connected via mini sas to 4 port sata breakout cable, I'm thinking if I stay in AHCI mode I could in theory by trial and error find the first drive on the controller and swap the data cable with that of the boot drive. My understanding is ZFS wont care the order the drives are connected and should adapt itself?

Option 2: Look in to a simple non bootable PCIE SATA card that truenas supports, connect all my data drives to that, and leave only the boot drives on the onboard controller, I could even mirror if that worked.

Option 3:
Other devices... As a slightly more outlandish but fun idea.. I have an older gen 7 server waiting in the wings I intend to use as a CCTV recorder, and I also have one of those AliExpress mini router platforms on its way.
It has a fair bit more CPU than the micro server. If there were fun things we could play with involving ISCSI, Network boot, clustering, anything like that.. I'm open to suggestions :)

If you read this far. Thank you for your interest and support, all opinions and advice greatly appreciated :)

Whattteva · Feb 16, 2023

Glad that it seems like you discovered a lot of stuff and made significant progress. It's looking more like an anomaly with HP systems more than your drives.

Urumiko said:
SO option 1:
I noticed that the data drives are actually connected via mini sas to 4 port sata breakout cable, I'm thinking if I stay in AHCI mode I could in theory by trial and error find the first drive on the controller and swap the data cable with that of the boot drive. My understanding is ZFS wont care the order the drives are connected and should adapt itself?

Yes, ZFS is pretty flexible and as long as it has direct access to the disks, you can mix and match drives between different ports or even different controllers.

Urumiko said:
Option 2: Look in to a simple non bootable PCIE SATA card that truenas supports, connect all my data drives to that, and leave only the boot drives on the onboard controller, I could even mirror if that worked.

I wouldn't advise this as most PCIe SATA cards tend to be bad. HPE does make good LSI HBA cards though. In fact, I am using one in my signature for my data drives (HPE HP220). I got it for only $30 or so on eBay and it's pretty solid. I did have to cross-flash it to the latest LSI firmware though. You may be able to get away without cross-flashing since you're using it on an HP system, but my Supermicro board freezes up and refuses to boot with the stock HP firmware.

Urumiko said:
Option 3:
Other devices... As a slightly more outlandish but fun idea.. I have an older gen 7 server waiting in the wings I intend to use as a CCTV recorder, and I also have one of those AliExpress mini router platforms on its way.
It has a fair bit more CPU than the micro server. If there were fun things we could play with involving ISCSI, Network boot, clustering, anything like that.. I'm open to suggestions :)

I'd honestly stay away from this option, but you're welcome to be adventurous. I would make sure to have a good backup strategy though.

WI_Hedgehog · Feb 17, 2023

Did you guys see how 01: Raw Read Error Rate and CC: Soft ECC Correction Rate match and jumped from 0 to D5A7F51?

Whattteva · Feb 17, 2023

WI_Hedgehog said:
Did you guys see how 01: Raw Read Error Rate and CC: Soft ECC Correction Rate match and jumped from 0 to D5A7F51?

Totally missed that. That's a massive jump. I think it's very unlikely for it to jump from 0 to 224 million in such a short time space to be a bad SSD.
It's probably something with the controller, cables, RAM, CPU, or any combination of those.

WI_Hedgehog · Feb 17, 2023

Whattteva said:
Totally missed that. That's a massive jump. I think it's very unlikely for it to jump from 0 to 224 million in such a short time space to be a bad SSD.
It's probably something with the controller, cables, RAM, CPU, or any combination of those.

And not even a 'Like', man you guys are stingy.

HoneyBadger · Feb 17, 2023

OCZ, like Seagate, uses a 48-bit value for their RAW_READ_ERROR_RATE.

The most-significant 16 bits are the error count, the least-significant 32 bits are the number of reads. So in this case, it's actually 0x00000D5A7F51, and dash-separated for visibility it's 0000-0D5A7F51 - or "zero errors in 224,034,641 reads" - and the latter 32-bit section loops around.

Once that first section jumps to 0001-0D5A7F51, then you have an error. In decimal value, once you break 2^32 (4,294,967,296) then be concerned.

HoneyBadger · Feb 17, 2023

Urumiko said:
So for the time being I have managed to get back online by booting the SSD attached to USB via a cheap caddy backplane and choosing USB boot. In theory as the server has an internal USB port I could live with this and get a slightly more robust adapter or purpose build USB drive.

This is actually the solution I would take. The discouraging of USB boot devices is more about the lack of endurance of the underlying NAND (discount thumbdrives aren't built to endure multiple P/E cycles) versus the protocol itself. You'll want to make sure it doesn't suffer from a "spontaneous grounding event" by insulating the electronics though if you took the backplane outside of its chassis.

WI_Hedgehog · Feb 17, 2023

HoneyBadger said:
OCZ, like Seagate, uses a 48-bit value for their RAW_READ_ERROR_RATE.

The most-significant 16 bits are the error count, the least-significant 32 bits are the number of reads. So in this case, it's actually 0x00000D5A7F51, and dash-separated for visibility it's 0000-0D5A7F51 - or "zero errors in 224,034,641 reads" - and the latter 32-bit section loops around.

Once that first section jumps to 0001-0D5A7F51, then you have an error. In decimal value, once you break 2^32 (4,294,967,296) then be concerned.

Good to know. How'd the drive start out at 0, given:

Urumiko said:
SO, This is crystal disk output from a drive I thought had failed. This is the drive I was running with for quite some time before the recent bout of failures, and was the first I thought had failed:

View attachment 63693

@Whattteva : HB made an informative point, therefore...

HoneyBadger · Feb 17, 2023

WI_Hedgehog said:
Good to know. How'd the drive start out at 0, given ("This is the drive I was running with for quite some time before the recent bout of failures, and was the first I thought had failed")

That's a good question. I would surmise that OCZ might reset the RAW_READ_ERROR_RATE value on each power loss.

(It wouldn't be the first time an OCZ device lost data I would consider valuable.)

WI_Hedgehog · Feb 17, 2023

HoneyBadger said:
That's a good question. I would surmise that OCZ might reset the RAW_READ_ERROR_RATE value on each power loss.

(It wouldn't be the first time an OCZ device lost data I would consider valuable.)

I've only seen that when issuing a SCSI format command to the drive controller on some HP drives, however that is something to consider...

Whattteva · Feb 17, 2023

WI_Hedgehog said:
And not even a 'Like', man you guys are stingy.

Haha, sorry about that. There ya go. I'm not a grinch!

Whattteva · Feb 17, 2023

WI_Hedgehog said:
Good to know. How'd the drive start out at 0, given:

@Whattteva : HB made an informative point, therefore...
View attachment 63722

Way to rat me out, you snitch!

Important Announcement for the TrueNAS Community.

Eating boot drives

Dabbler

Wizard

Dabbler

Guru

Dabbler

Dabbler

Wizard

Dabbler

Dabbler

Wizard

Guru

Wizard

Guru

actually does care

actually does care

Guru

actually does care

Guru

Wizard

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Eating boot drives"

Similar threads