new setup, immediate disk errors

mpainter701 · Nov 4, 2022

New setup, supermicro chassis, nvme disks.
Tried nfs3,4, and iscsi, but whenever i try putting data on the pools (2 pools, different disks) regardless they spit out these disk errors.

Originally i was getting 'degraded' states etc, seems all over the place. felt so flaky figured was a firmware bug. I've updated host to latest, and truenas to the 11/1/22 patch, rebuilt pools and migrated cold data to it and let sit over night and got the error below and status print. Probably will go degraded if i put hot data on it. Not convinced my enterprise drives, 2 brands, are all failed. Any thoughts?

Pool Main-Pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

root@truenas[~]# zpool status -v
pool: Backup-pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:09:46 with 0 errors on Fri Nov 4 06:41:51 2022
config:

NAME STATE READ WRITE CKSUM
Backup-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/a18b6df2-5acc-11ed-b96f-3cecefbffbd4 ONLINE 0 0 4
gptid/a18c2c3f-5acc-11ed-b96f-3cecefbffbd4 ONLINE 0 0 4

errors: No known data errors

pool: Main-Pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:02:21 with 0 errors on Fri Nov 4 06:29:38 2022
config:

NAME STATE READ WRITE CKSUM
Main-Pool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/ed66172b-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 30 0
gptid/ed682fe1-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 28 0
gptid/ed6606ef-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 32 0
gptid/ed695fe2-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 28 0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:06 with 0 errors on Fri Nov 4 03:45:06 2022
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors

Redcoat · Nov 4, 2022

Welcome to the forums!

Crystal ball not working today - please supply full info as requested in Forum Rules (red link on masthead) in order to give background material to your issue. Please be sure to identify drive interfaces (and what's connected to what if you have add-on cards).

mpainter701 · Nov 4, 2022

Motherboard make and model, SuperMicro mobo X12SPO-NTF

CPU make and model 2x Intel(R) Xeon(R) Gold 5317 CPU
RAM quantity 128GB DDR4
Hard drives, quantity, model numbers, and RAID configuration, including boot drives
Boot SuperMicro SSD SataDom 64GB
Pool A - 4x Intel 8TB INTEL DC P4510 SSDPE2KX080T8
Pool B - 2x Dell DC NVMe PE8010 RI U.2 3.84TB
Hard disk controllers, NA
Network cards
Supermicro TN box has 2x integrated 10Gbase-T nics
Sw has 10gb SFP+, copper 10gb modules.
Mlag setup, with jumboframes enabled, tried also 9000 mtu on nas side.
port config
interface agg4
description NAS-MLAG-LACP
switchport mode trunk
switchport trunk native vlan 150
switchport trunk allowed vlan all
mlag 4

jgreco · Nov 4, 2022

mpainter701 said:
regardless they spit out these disk errors.

Did the disk I/O systems test okay during your burn-in testing? The time to isolate problems with a new system, especially if you are unfamiliar with the hardware, is during the burn-in phase; loading TrueNAS on a questionable platform and just hoping for the best is just asking for disaster.

Building, Burn-In, and Testing your FreeNAS system

I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without...

www.truenas.com

mpainter701 · Nov 4, 2022

i guess i ran smart and moved on after
SMART overall-health self-assessment test result: PASSED

resulted on all them....

Its not a makeshift box i bought all enterprise hardware, server was OEM built.

jgreco · Nov 4, 2022

mpainter701 said:
i guess i ran smart and moved on after
SMART overall-health self-assessment test result: PASSED
resulted on all them....

So what this tells me is that you did next to no meaningful testing.

mpainter701 said:
Its not a makeshift box i bought all enterprise hardware, server was OEM built.

I once had Dell send me a massive 4U server with eight external SCSI busses (back in the day) which were intended to drive eight shelves each holding nine 72GB drives; this literally was a full rack of server. Dell specifically sold us a PERC controller that was supposed to be able to cope with this much I/O. I wrote the original solnet-array-test script because it turned out that the PERC was crap and couldn't maintain I/O rates. When Dell then gave us trouble about returning the server, I blacklisted them and have refused to buy their server products for 22 years now.

Your purchase of "enterprise hardware" that was "OEM built" is not particularly persuasive as to its correctness or quality. I bought an X12 BTO system from Supermicro earlier this year and was annoyed to find that they had installed some parts that weren't to spec. This was really annoying but parts availability in the channel is really screwed to hell right now and a BTO was the only way I could even get the (fairly specific and unusual) stuff I wanted.

Stuff happens. Your system is not guaranteed to work as you expect just because you ordered it from somewhere. Most companies who are in my line of work will throw stuff together, do basic testing, box it, and ship it as fast as they can. I've been told that the weeks of burn-in we do around here is excessively paranoid. It is, except when it catches a problem, which does indeed happen now and then.

mpainter701 · Nov 4, 2022

fair enough, ive been reading through the burn test guides.

doing the longer disk tests now, will do the memory test over weekend

mpainter701 · Nov 9, 2022

so i did a burn in,

memtest, cputest, and i got a perfect diag report from supermicros tool since i use their board.
I also checked for firmware patches on the nvme disks, im on latest.

SAME ISSUE persists!

jgreco · Nov 9, 2022

mpainter701 said:
supermicros tool

What tool is that?

mpainter701 said:
SAME ISSUE persists!

Then clearly you're not getting good results from that tool, right?

mpainter701 · Nov 14, 2022

Super Diagnostics Offline (SDO) | Supermicro Server Management Utilities | Supermicro

Super Diagnostics Offline provides the capability to determine the health of Supermicro server's components, including CPU, memory, I/O.

www.supermicro.com

is their diag suite, seems to be thorough.

jgreco · Nov 14, 2022

mpainter701 said:
Super Diagnostics Offline (SDO) | Supermicro Server Management Utilities | Supermicro

Super Diagnostics Offline provides the capability to determine the health of Supermicro server's components, including CPU, memory, I/O.

www.supermicro.com

is their diag suite, seems to be thorough.

It's trash for burn-in purposes. That's a diagnostic tool to identify problems with components primarily on the mainboard. It is not a burn-in tool.

mpainter701 · Nov 14, 2022

yeah but i've followed the documented burn in methods you all recommend, in addition i've done this tool here for exactly that to see if there's some obvious known issues going on.

jgreco · Nov 14, 2022

mpainter701 said:
yeah but i've followed the documented burn in methods you all recommend, in addition i've done this tool here for exactly that to see if there's some obvious known issues going on.

I haven't seen any results from solnet-array-test here; I would generally expect it to pick up on many of the same issues that ZFS might pick up on during stressy concurrent I/O. As I previously said, the time to shake out problems is during burn-in. If you think you passed burn-in but then are having problems running TrueNAS, my professional opinion is that you didn't try sufficiently hard during the burn-in.

mpainter701 · Nov 15, 2022

I switched to ubuntu with openzfs, and we've not ran into any disk or checksum errors. Whereas with truenas within first 10min of pushing data to it. Pool A would always give disk write erros, and Pool B would give checksum errors.

I didnt post the logs for burn in as it found nothing, im convinced theres some nvme driver/freebsd stuff going on here with the truenas kernel provided. Im just a network engineer though, idk jack about this stuff, and frankly all the guides here a 1/4" to deep for me to process entirely or are not 1:1 what im doing. maybe im losing something in translation.

My end goal is definitely to use truenas, but im lost on how to proceed. Burn in tests dont make sense for me to keep doing if it works fine on other OS.

ubuntu
root@storage:/alpha# zpool status
pool: alpha
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
alpha ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0

errors: No known data errors

pool: beta
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
beta ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0

errors: No known data errors
root@storage:/alpha#

vs endless logs like this.

Nov 8 09:03:54 truenas nvme2: WRITE sqid:1 cid:119 nsid:1 lba:2009928616 len:88
Nov 8 09:03:54 truenas nvme2: DATA TRANSFER ERROR (00/04) sqid:1 cid:119 cdw0:0
Nov 8 09:03:54 truenas nvme2: async event occurred (type 0x0, info 0x04, page 0x01)
Nov 8 09:03:54 truenas nvme2: WRITE sqid:2 cid:125 nsid:1 lba:2253114024 len:32
Nov 8 09:03:54 truenas nvme1: WRITE sqid:1 cid:123 nsid:1 lba:2253113992 len:24
Nov 8 09:03:54 truenas nvme2: async event occurred (type 0x0, info 0x04, page 0x01)
Nov 8 09:03:54 truenas nvme2: DATA TRANSFER ERROR (00/04) sqid:2 cid:125 cdw0:0
Nov 8 09:03:54 truenas nvme1: async event occurred (type 0x0, info 0x04, page 0x01)
Nov 8 09:03:54 truenas nvme1: DATA TRANSFER ERROR (00/04) sqid:1 cid:123 cdw0:0
Nov 8 09:03:54 truenas 1 2022-11-08T09:03:54.397333-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme1, number of Error Log entries increased from 5009290 to 5009292
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.530773-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme3, number of Error Log entries increased from 124 to 135
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.686813-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme2, number of Error Log entries increased from 173 to 180
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.686992-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme1, number of Error Log entries increased from 5009292 to 5009302
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.687146-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme0, number of Error Log entries increased from 177 to 183
Nov 8 09:33:54 truenas nvme0: WRITE sqid:2 cid:122 nsid:1 lba:2487692184 len:88

mpainter701 · Nov 15, 2022

Also fwiw, it does appear ubuntu is allowing for faster file transfers as well, I was never able to get over 1.2GB write speeds on truenas. Whereas it appears to have doubled here.

jgreco · Nov 15, 2022

mpainter701 said:
Burn in tests dont make sense for me to keep doing if it works fine on other OS.

When your car engine isn't working the way you expect when everyone else's is, you don't take the car on a cross-country trip. You keep it in the garage and pursue the problem until you discover which thing is being problematic. If you don't think that makes sense, then I can't help you.

mpainter701 · Nov 15, 2022

what specifically do you want me to do for us to proceed.

jgreco · Nov 15, 2022

mpainter701 said:
what specifically do you want me to do for us to proceed.

I STILL haven't seen any results from solnet-array-test here; I would generally expect it to pick up on many of the same issues that ZFS might pick up on during stressy concurrent I/O. As I previously said, the time to shake out problems is during burn-in. If you think you passed burn-in but then are having problems running TrueNAS, my professional opinion is that you didn't try sufficiently hard during the burn-in.

So what I want is to cause problems to pop up outside of the NAS environment. The presence of SMART errors for multiple devices is cause for concern and is *sort* of what I'm referring to; we should look at those, but I'm not really an expert on the P4510's. solnet-array-test is a somewhat tepid test in that it only does reads, but it is looking for things that "are unequal", which can be very helpful. Additionally, it is super-easy to run tests on subsets of your drives. This is all rather difficult while running ZFS. Your initial problem report that this involves writes might make this irrelevant, but it would be interesting if problems showed up here. SSD's should be able to sustain massive amounts of reads with no errors, so it is an easy low-hanging fruit to test basic functionality.

mpainter701 · Dec 2, 2022

UPDATE/Resolved.

Turned out to be a bios setting.

Supermicro has some intel VMD service that needs to be disabled. They have this VROC/Intel based software raid junk, even if your not using it, theres a bios setting that still needs to be disabled thats 3 pages deep. Otherwise the OS gets confuzzled about drivers from what some logs appeared.

Im not sure why ubuntu with openzfs was more stable than truenas, but its irrelevant at this point.

HoneyBadger · Dec 2, 2022

mpainter701 said:
UPDATE/Resolved.

Turned out to be a bios setting.

Supermicro has some intel VMD service that needs to be disabled. They have this VROC/Intel based software raid junk, even if your not using it, theres a bios setting that still needs to be disabled thats 3 pages deep. Otherwise the OS gets confuzzled about drivers from what some logs appeared.

Im not sure why ubuntu with openzfs was more stable than truenas, but its irrelevant at this point.

Glad you got to the bottom of this. Is it now stable in both TrueNAS CORE and SCALE?

For those wanting to disable Intel VROC (Virtual RAID on CPU) or verify its state on their Supermicro board, please see the PDF manuals below for the X11/X12 series respectively:

https://www.supermicro.com/manuals/other/AOC-VROCxxxMOD_Windows.pdf

https://www.supermicro.com/manuals/other/X12_Intel_VROC_RAID_Configuration_Guide.pdf

Important Announcement for the TrueNAS Community.

new setup, immediate disk errors

Dabbler

Pool Main-Pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.​

MVP

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

Dabbler

Resident Grinch

Dabbler

Resident Grinch

Dabbler

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "new setup, immediate disk errors"

Similar threads

Pool Main-Pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.