Running into read/write timeouts during burn-in

kevitech

Cadet
Joined
Oct 27, 2022
Messages
3
Hi all,

I'm building my second TrueNAS system and trying to use server-grade hardware this time around. The machine I built is inspired by this [other post](https://www.truenas.com/community/threads/will-this-ryzen-build-freenas-and-should-i-go-with-scale-over-core.99493/) because I want I want a relatively compact NAS.

Specs:
- CPU: Ryzen 9 5900X
- Motherboard: ASRock Rack X570D4U
- RAM: 2x16GB Kingston KSM32ES8/16MF
- Case: Fractal Node 804
- PSU: EVGA G6 850W
- OS: currently TrueNAS-SCALE-22.12-BETA.2 (originally was on the "stable" version)
- Drives: 5x 4TB WD Reds (4x WD40EFZX, 1x WD40EFRX) all currently plugged into SATA ports on motherboard

For the drives, I had been buying them over a period of several months as they had gone on sale. I have five new drives (plan is to have 6 in the array, but the other two are in a mirror in the existing NAS). I have everything built, ran memtest for a day or two, ran all the SMART tests, and I've been trying to burn in the drives using badblocks as mentioned in the [burn-in test guide](https://www.truenas.com/community/resources/hard-drive-burn-in-testing.92/).

The problem I am running into is that when I run badblocks on all five drives at once, only one (and it's always the same drive) of them will properly write the entire drive. The rest will run for some amount of time then hang.

Code:
dmesg --level=emerg,alert,crit,err

[375916.109136] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0xd0000 action 0x6 frozen
[375916.116964] ata2: SError: { PHYRdyChg CommWake 10B8B }
[375916.122522] ata2.00: failed command: WRITE DMA EXT
[375916.127739] ata2.00: cmd 35/00:00:00:26:e1/00:02:16:00:00/e0 tag 12 dma 262144 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)


Smartctl info for one of the bad drives (after running badblocks):
Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFZX-68AWUN0
Serial Number:    WD-WX52DB1E1REX
LU WWN Device Id: 5 0014ee 26a3b56e3
Firmware Version: 81.00B81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Wed Oct  5 22:55:22 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


I notice that the SATA speed drops to 1.5Gb/s. Disk usage drops to near zero for those drives, and the regular smart checks that TrueNAS does reports errors because it's unable to read the smart attributes. On restart, everything is back to normal.

Troubleshooting steps I've taken:
* Reseated all SATA cables on both sides
* Swapped the good SATA connection on the mobo with a bad one with no effect, good drive still works bad drive still does not so it's not a bad port
* Tried Ubuntu, and TrueNAS Core, and now also updated to the latest beta for SCALE for latest kernel
* Booted to Windows to run WD software (thought I would be able to update the firmware)
* Contacted WD support, they have mentioned to talk to the vendor

I'm not sure if it's a red herring, but the one drive that works is the WD40EFRX which is WD Red before they rebranded to the "Plus" branding. It has a newer firmware than the newer Plus drives. WD says they do not provide firmware updates for drives. In any case, I'm not really sure what else to try at this point. It seems pretty unlikely that all four of those drives are bad, but I don't know what other steps I should take to understand and resolve the problem.

Please let me know if there is other information that I should post for more context, this is my first time post though I've been lurking for a while on/off.

Thanks!
 

kevitech

Cadet
Joined
Oct 27, 2022
Messages
3
Since I posted, I've now also done these steps:

* Updated BIOS to latest version which came out a few weeks ago
* Moved one "bad" drive over to my existing NAS, and the drive is working fine during a burn-in (badblocks). It's gotten a full drive write done and is currently reading/comparing (it never gets anywhere near this far in the test in the new NAS). So the drives are likely to be fully functioning, just incompatible somehow
* Created a support case with ASRock today with my current findings - haven't heard back any solutions yet but they may try to replicate my issue

I was hoping I could avoid using an HBA because I have all the SATA ports I need on the board already, and I also don't like the idea of spending $400+ on a server motherboard for a NAS but it can't natively handle IO for HDDs. I would love to know what the reason is for this behavior, because the plan for this NAS is to handle mostly everything in my house so I would like for it to be rock solid.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I was hoping I could avoid using an HBA because I have all the SATA ports I need on the board already, and I also don't like the idea of spending $400+ on a server motherboard for a NAS but it can't natively handle IO for HDDs.

It is unclear that you've really purchased a "server motherboard". The SATA controllers known to work correctly are mostly Intel PCH or Intel SCU based. There are some others. I don't really expect the AMD PCH to be faulty, but on the other hand, it's an unknown part and you're having trouble.
 

kevitech

Cadet
Joined
Oct 27, 2022
Messages
3
That's a fair statement. I'm very curious to hear what the issue is and if I find anything out I'll post back, but in the meantime I'll get an HBA. While I wouldn't necessarily expect the onboard SATA controller to be as performant as an HBA, I would at least expect that it could handle sustained writes from a single drive without a failure.

Looking at getting https://www.ebay.com/itm/162862201664 (LSI 9217-8i) and 2x https://www.ebay.com/itm/163903342497 (4x SATA breakout). Might be more expensive than other options but I'm fine with paying extra for what appears to be validation and responsive support.
 
Top