HUH721010AL4200 Firmware Hell

schmidtd0039 · Feb 21, 2024

Hello everyone!

In light of the scare I had in https://www.truenas.com/community/t...-spanned-2xvdev-with-8x10tb-hdds-each.116288/, I'm combing through my NAS config to find things I missed in my original setup, as well as restoring my two hotspare drives that were consumed in the mentioned double resilver. Note that both these failed drives were HUH721010AL42C0's on A3Z4.

However, I've fallen into the HGST firmware rabbit hole.

In my NAS, I have 15 HUH721010AL42C0, and one HUH721010AL4200. Thankfully, both types are at 4K sectors, but are on mismatched firmware despite being the exact same drive model from the same reseller (deals2day). HUH721010AL42C0 are all matching on firmware A3Z4, the HUH721010AL4200 is on A21D. I never noticed the one mismatched drive at the time of the build. From online, I'm understanding these are the exact same drives, and the "C" indicates Cisco branding and firmware. Easy enough, and seems cross-flashing works fine.

---So my first question here, what is the BEST and safest way to upgrade the firmware on that single in-use drive? Do I have to remove, resilver with another drive? Or can I shut down the NAS, pull the drive, update the firmware too A3Z4 (for matching), and put it back in and TruNAS won't notice? My concern would be that flashing firmware, especially cross "vendor", would harm the data or how TruNAS recognizes the drive's membership in the pool.

On top of this decision, I'm having very hard time finding change notes or any information on HGST firmware updates. On HDD Guru's files, I see the newest firmware revision of LHGNAB01 listed for the HUH721010AL4200 line of drives, however in some searches, I don't see anyone using that version, but LHGNA9G0 is being referenced as people's go-to, from 2023. This is a relevant choice, as all three spare drives I just ordered from the same vendor, came as HUH721010AL4200 with firmware A21D, which I see online has known issues, so I'd like to address before adding to the pool as hotspares - but I'm on the fence if these should be flashed to AB01, A9G0, A3Z4, or something else entirely.

---So second question - is it smarter for me to update these firmwares (overall) to A3Z4 to match the rest of my drives as a baseline? Or do I update my newest drives (and potentially the 1 rogue A21D) to the newest available AB01? (and depending on that risk/impact, maybe even flash the other 15 after taking some fresh-full state backups?

Any advice here would be appreciated!

schmidtd0039 · Feb 21, 2024

So in a bout of curiosity, I've decided to flash two of my new hotspares on one of each FW rev - that being one on the timestamped-latest AB01 version, and the other on my existing Cisco A3Z4 with Hugo on Windows
.\hugo update -s {drive SN} -f {firmware file}.bin

That went great from an early perspective, no issues there, and I've now started my burn-in process using BadBlocks with the drives mounted to the local WSL2 Ubuntu instance. Interestingly, the Generic AB01 firmware drive is running about twice as fast, percentage wise, than the Cisco A3Z4 one...
After 20 minutes of "sudo badblocks -svw -t random -b 4096 /dev/sdx", A3Z4 is at 1.2%, and AB01 is at 3.3%. Quite a significant disparity. Additionally, the disks are also reporting significantly different perf numbers - with the Cisco having much higher latency and slower performance than the Generic (see screenshot). The Cisco is performing like a 5400RPM drive from the early 2000s, and the Generic is performing like I'd expect a few-year-old DC drive to.

Unfortunately, these runs typically take a few days, so it'll be difficult to gather more data to ascertain whether or not these are just two wildly different performing drives, or if one is having issues or not, of if the firmware truly resulted in that degree of a performance bump. I did a quick smart test and viewed the error counts on both before BadBlocks, and it was straight zeros on both, so no signs of early failure there.

After badblocks, I typically run a long-smart (which takes another day), and view the counts again to be sure the drive is GTG.

This time around, as long as my NAS doesn't lose a 3rd drive this month, I might experiment more on this by flipping the firmware versions on these two drives, and rerunning badblocks and some other tests to see if the performance moves with the firmware or stays with the drive.

If the firmware really does result in 3x the perf gains, boy do I want to update the 16 drives in my NAS to that version, as then being on A3Z4 for my pool could very well be the contributing reason my 10TB Drive resilvers are 60+ hours. That said, I'm still having trouble finding an answer to my question #2 - if there's a best practice to updating the firmware of drives in an in-use pool or if anyone has experience doing such.

schmidtd0039 · Feb 22, 2024

Any suggestions here? The lack of response makes me feel like I might be in uncharted territory.

To update, the performance gap persists as badblocks is still running on both drives. The Generic has almost fully lapped the Cisco in it's badblocks, where Generic is almost done with it's Read test, and Cisco is still writing it's first pass after 24hours.

Arwen · Feb 22, 2024

Sometimes proprietary firmware does things differently that OEM firmware.

For example, Sun Microsystems used to reduce the size of all it's OEM disks to be a common size. This made replacing a Sun disk of a specific size trivial and no hassle at all.

Another example, is a hardware RAID-5 array that I worked on years ago, (and was years old even then). It did Read after Write for verification. This actually saved the application a bit, by activating spare sectors on write failures, keeping these hardware RAID-5 arrays functioning. (They were ignored before I took over, so I had a lot of cleanup to do, like replacing all the bad disks and activating new hot spares.)

Now in your case, it appears Cisco firmware has done something else.

For some of us forum people, your problem is outside of our experience. Sorry.

schmidtd0039 · Feb 24, 2024

Hi Arwen!

Thanks for your reply! Understandable I may be treading too deep in uncharted waters - and I've taken my inquiries, at least related to firmware differences and which is considered "golden" to HGST / WD support. Whether or not they'll help is yet to be seen.

Regardless, I'm continuing to do testing on my 2 new hotspares with firmware differences to see what I can find out on my own - and updating here just to maybe help others that fall into this same journey.

ADDITIONALLY, in my testing on these new hostspares, I noted that I encountered a similar "bug" on the Cisco A3Z4 test drive to what killed 2 drives in my primary array - where a SMART Long test would just get stuck and run indefinitely, even after reboots and drive reseats (the drive ran for 48 hours compared to the 17 it should take per others). All subsequent SMART tests fail due to this hung test, and drive latency spikes to 40000ms - making it unusable (reseating it briefly fixes it). The drive never dies or stops communicating, but disk activity grinds to a complete halt due to it. I couldn't find a way to clear that bad test with the drives as-is; however, updating a "stuck" drive to AB01 immediately aborts the test, and new tests work and pass fine. This could just be a symptom or action of a drive firmware update itself (clearing an active smart test), but any attempts other than that to clear it failed - whereas a firmware update both cleared the bad test - AND the drive seemingly has returned to normal with the hung test gone.

I'm experimenting with this more as we speak. I've taken my 2 "dead" drives from the last 2 weeks, verified the hung SMART tests were still present and causing issues via my Windows test box, reseated them, and then committed them to firmware updates for AB01. Just like the hotspare I thought I had bricked but fixed, these two formerly bad drives seemingly came back to life - and as noted before, are performing better than on their prior firmware even.

I in no way trust these formerly problem drives to any capacity yet, this is more for the sake of science, but I'm letting a few runs of BadBlocks rip on them now to see if I can get any semblance of failure on them to repeat. (Long test to follow after a few days of burn-in)

However, this does make me worry greatly that my "lab production" drives running A3Z4 are "timebombed" or bugged in some way where they may all encounter this issue someday soon, making me more anxious and seeking answers.

Or, you know, also debating just shelling the $2K I'd need to to build a second pool out of better-supported 20TB drives instead using mirrors and leaving these HGSTs behind me. Not sure yet, what irks me is this setup has been rock-solid for going on 3 years now, and just suddenly I've had 2 back-to-back live drives have issues, and one "fresh" spare also do the same....

*I also noted that, at least with Windows volumes and some simple/dumb data loaded on it, data is not lost when cross-flashing between A3Z4 and AB01 - SMART tests still pass fine and data is accessible. Drives are recognized as exactly the same sector and storage size between versions, only the Model Number and how some SMART stats are displayed seem to change. I have not yet tested this with TruNAS data - but intend to. I think once I wrap up what I'm doing now (below), I will build a test TruNAS server on some older hardware using these non-currently-used drives, and test my firmware updates there on a test-pool and SMB share to see what behavior I can expect.
Tenative process to validate:

Take a fresh, full offline backup on my external drives, just in case
Power down the NAS
Remove one drive
Update single drive on a secondary Windows system (that I use for running backups off the share) with HGST-Hugo to AB01, reboot, short-SMART test to verify, and add back to the NAS. Power back up and make sure the pool doesn't degrade, and the disk is still recognized as an active member but on new firmware. (Maybe even test some writes to and from the pool and check the stats on that specific drive)
Depending on my anxiety after that, update another one from the opposite vDev, test and verify again.
Then maybe start doing two at a time, one from each vDev to flesh out the rest of the flashes.

Arwen · Feb 25, 2024

One thing I did on my prior 4 disk RAID-Z2 pool, was to scatter purchases and models for the drives;

Buy 1 WD Red 4TB, (CMR), retail
Order 1 WD Red Pro 4TB mail order
Wait, then buy 1 WD Red 4TB, (CMR), retail but in bulk packaging
Wait, order 1 WD Red Pro 4TB mail order

The intent was to play with the software in the early days, before I had all the disks. Determine if I wanted 2, 2 disk Mirrors or RAID-Z2. Same amount of storage, different failure modes.

Then, with scattered purchases and models, I had less to worry about in single model failure. Like due to firmware.

My new NAS does something similar.

Now can everyone do this?
Perhaps not.

Is this going to guarantee safety with the disks?
Of course not.

Can purchasing different models, or even different vendors cause problems.
Absolutely. It is known fact that Seagate, Western Digital and others don't use the exact number of blocks in the size tier. So a 10TB disk from one vendor might not be able to replace a 10TB disk from another vendor if the replacement disk is even 1 block smaller. (TrueNAS sometimes has swap on data disks, which can compensate for that problem...)

With only a few vendors, WD, Seagate and Toshiba, we have a lot fewer options to scatter models.

Important Announcement for the TrueNAS Community.

HUH721010AL4200 Firmware Hell

schmidtd0039

Dabbler

schmidtd0039

Dabbler

Attachments

schmidtd0039

Dabbler

Arwen

MVP

schmidtd0039

Dabbler

Arwen

MVP

Similar threads

Important Announcement for the TrueNAS Community.

HUH721010AL4200 Firmware Hell

schmidtd0039

Dabbler

schmidtd0039

Dabbler

Attachments

schmidtd0039

Dabbler

Arwen

MVP

schmidtd0039

Dabbler

Arwen

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HUH721010AL4200 Firmware Hell"

Similar threads