new setup, immediate disk errors

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
New setup, supermicro chassis, nvme disks.
Tried nfs3,4, and iscsi, but whenever i try putting data on the pools (2 pools, different disks) regardless they spit out these disk errors.

Originally i was getting 'degraded' states etc, seems all over the place. felt so flaky figured was a firmware bug. I've updated host to latest, and truenas to the 11/1/22 patch, rebuilt pools and migrated cold data to it and let sit over night and got the error below and status print. Probably will go degraded if i put hot data on it. Not convinced my enterprise drives, 2 brands, are all failed. Any thoughts?

Pool Main-Pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.​



root@truenas[~]# zpool status -v
pool: Backup-pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:09:46 with 0 errors on Fri Nov 4 06:41:51 2022
config:

NAME STATE READ WRITE CKSUM
Backup-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/a18b6df2-5acc-11ed-b96f-3cecefbffbd4 ONLINE 0 0 4
gptid/a18c2c3f-5acc-11ed-b96f-3cecefbffbd4 ONLINE 0 0 4

errors: No known data errors

pool: Main-Pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 00:02:21 with 0 errors on Fri Nov 4 06:29:38 2022
config:

NAME STATE READ WRITE CKSUM
Main-Pool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/ed66172b-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 30 0
gptid/ed682fe1-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 28 0
gptid/ed6606ef-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 32 0
gptid/ed695fe2-5bba-11ed-a060-3cecefbffbd4 ONLINE 0 28 0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:06 with 0 errors on Fri Nov 4 03:45:06 2022
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Welcome to the forums!

Crystal ball not working today - please supply full info as requested in Forum Rules (red link on masthead) in order to give background material to your issue. Please be sure to identify drive interfaces (and what's connected to what if you have add-on cards).
 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
  • Motherboard make and model, SuperMicro mobo X12SPO-NTF
    CPU make and model 2x Intel(R) Xeon(R) Gold 5317 CPU
  • RAM quantity 128GB DDR4
  • Hard drives, quantity, model numbers, and RAID configuration, including boot drives
    Boot SuperMicro SSD SataDom 64GB
    Pool A - 4x Intel 8TB INTEL DC P4510 SSDPE2KX080T8
    Pool B - 2x Dell DC NVMe PE8010 RI U.2 3.84TB
  • Hard disk controllers, NA
  • Network cards
    Supermicro TN box has 2x integrated 10Gbase-T nics
    Sw has 10gb SFP+, copper 10gb modules.
  • Mlag setup, with jumboframes enabled, tried also 9000 mtu on nas side.
  • port config
  • interface agg4
    description NAS-MLAG-LACP
    switchport mode trunk
    switchport trunk native vlan 150
    switchport trunk allowed vlan all
    mlag 4


 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
regardless they spit out these disk errors.

Did the disk I/O systems test okay during your burn-in testing? The time to isolate problems with a new system, especially if you are unfamiliar with the hardware, is during the burn-in phase; loading TrueNAS on a questionable platform and just hoping for the best is just asking for disaster.

 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
i guess i ran smart and moved on after
SMART overall-health self-assessment test result: PASSED

resulted on all them....

Its not a makeshift box i bought all enterprise hardware, server was OEM built.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
i guess i ran smart and moved on after
SMART overall-health self-assessment test result: PASSED
resulted on all them....

So what this tells me is that you did next to no meaningful testing.

Its not a makeshift box i bought all enterprise hardware, server was OEM built.

I once had Dell send me a massive 4U server with eight external SCSI busses (back in the day) which were intended to drive eight shelves each holding nine 72GB drives; this literally was a full rack of server. Dell specifically sold us a PERC controller that was supposed to be able to cope with this much I/O. I wrote the original solnet-array-test script because it turned out that the PERC was crap and couldn't maintain I/O rates. When Dell then gave us trouble about returning the server, I blacklisted them and have refused to buy their server products for 22 years now.

Your purchase of "enterprise hardware" that was "OEM built" is not particularly persuasive as to its correctness or quality. I bought an X12 BTO system from Supermicro earlier this year and was annoyed to find that they had installed some parts that weren't to spec. This was really annoying but parts availability in the channel is really screwed to hell right now and a BTO was the only way I could even get the (fairly specific and unusual) stuff I wanted.

Stuff happens. Your system is not guaranteed to work as you expect just because you ordered it from somewhere. Most companies who are in my line of work will throw stuff together, do basic testing, box it, and ship it as fast as they can. I've been told that the weeks of burn-in we do around here is excessively paranoid. It is, except when it catches a problem, which does indeed happen now and then.
 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
fair enough, ive been reading through the burn test guides.

doing the longer disk tests now, will do the memory test over weekend
 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
so i did a burn in,

memtest, cputest, and i got a perfect diag report from supermicros tool since i use their board.
I also checked for firmware patches on the nvme disks, im on latest.

SAME ISSUE persists!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
yeah but i've followed the documented burn in methods you all recommend, in addition i've done this tool here for exactly that to see if there's some obvious known issues going on.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
yeah but i've followed the documented burn in methods you all recommend, in addition i've done this tool here for exactly that to see if there's some obvious known issues going on.

I haven't seen any results from solnet-array-test here; I would generally expect it to pick up on many of the same issues that ZFS might pick up on during stressy concurrent I/O. As I previously said, the time to shake out problems is during burn-in. If you think you passed burn-in but then are having problems running TrueNAS, my professional opinion is that you didn't try sufficiently hard during the burn-in.
 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
I switched to ubuntu with openzfs, and we've not ran into any disk or checksum errors. Whereas with truenas within first 10min of pushing data to it. Pool A would always give disk write erros, and Pool B would give checksum errors.

I didnt post the logs for burn in as it found nothing, im convinced theres some nvme driver/freebsd stuff going on here with the truenas kernel provided. Im just a network engineer though, idk jack about this stuff, and frankly all the guides here a 1/4" to deep for me to process entirely or are not 1:1 what im doing. maybe im losing something in translation.

My end goal is definitely to use truenas, but im lost on how to proceed. Burn in tests dont make sense for me to keep doing if it works fine on other OS.


ubuntu
root@storage:/alpha# zpool status
pool: alpha
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
alpha ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0

errors: No known data errors

pool: beta
state: ONLINE
config:

NAME STATE READ WRITE CKSUM
beta ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0

errors: No known data errors
root@storage:/alpha#


vs endless logs like this.

Nov 8 09:03:54 truenas nvme2: WRITE sqid:1 cid:119 nsid:1 lba:2009928616 len:88
Nov 8 09:03:54 truenas nvme2: DATA TRANSFER ERROR (00/04) sqid:1 cid:119 cdw0:0
Nov 8 09:03:54 truenas nvme2: async event occurred (type 0x0, info 0x04, page 0x01)
Nov 8 09:03:54 truenas nvme2: WRITE sqid:2 cid:125 nsid:1 lba:2253114024 len:32
Nov 8 09:03:54 truenas nvme1: WRITE sqid:1 cid:123 nsid:1 lba:2253113992 len:24
Nov 8 09:03:54 truenas nvme2: async event occurred (type 0x0, info 0x04, page 0x01)
Nov 8 09:03:54 truenas nvme2: DATA TRANSFER ERROR (00/04) sqid:2 cid:125 cdw0:0
Nov 8 09:03:54 truenas nvme1: async event occurred (type 0x0, info 0x04, page 0x01)
Nov 8 09:03:54 truenas nvme1: DATA TRANSFER ERROR (00/04) sqid:1 cid:123 cdw0:0
Nov 8 09:03:54 truenas 1 2022-11-08T09:03:54.397333-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme1, number of Error Log entries increased from 5009290 to 5009292
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.530773-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme3, number of Error Log entries increased from 124 to 135
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.686813-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme2, number of Error Log entries increased from 173 to 180
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.686992-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme1, number of Error Log entries increased from 5009292 to 5009302
Nov 8 09:33:54 truenas 1 2022-11-08T09:33:54.687146-06:00 truenas.np.run smartd 2644 - - Device: /dev/nvme0, number of Error Log entries increased from 177 to 183
Nov 8 09:33:54 truenas nvme0: WRITE sqid:2 cid:122 nsid:1 lba:2487692184 len:88
 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
Also fwiw, it does appear ubuntu is allowing for faster file transfers as well, I was never able to get over 1.2GB write speeds on truenas. Whereas it appears to have doubled here.

image.png
image (5).png
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Burn in tests dont make sense for me to keep doing if it works fine on other OS.

When your car engine isn't working the way you expect when everyone else's is, you don't take the car on a cross-country trip. You keep it in the garage and pursue the problem until you discover which thing is being problematic. If you don't think that makes sense, then I can't help you.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
what specifically do you want me to do for us to proceed.

I STILL haven't seen any results from solnet-array-test here; I would generally expect it to pick up on many of the same issues that ZFS might pick up on during stressy concurrent I/O. As I previously said, the time to shake out problems is during burn-in. If you think you passed burn-in but then are having problems running TrueNAS, my professional opinion is that you didn't try sufficiently hard during the burn-in.

So what I want is to cause problems to pop up outside of the NAS environment. The presence of SMART errors for multiple devices is cause for concern and is *sort* of what I'm referring to; we should look at those, but I'm not really an expert on the P4510's. solnet-array-test is a somewhat tepid test in that it only does reads, but it is looking for things that "are unequal", which can be very helpful. Additionally, it is super-easy to run tests on subsets of your drives. This is all rather difficult while running ZFS. Your initial problem report that this involves writes might make this irrelevant, but it would be interesting if problems showed up here. SSD's should be able to sustain massive amounts of reads with no errors, so it is an easy low-hanging fruit to test basic functionality.
 

mpainter701

Dabbler
Joined
Nov 4, 2022
Messages
11
UPDATE/Resolved.


Turned out to be a bios setting.

Supermicro has some intel VMD service that needs to be disabled. They have this VROC/Intel based software raid junk, even if your not using it, theres a bios setting that still needs to be disabled thats 3 pages deep. Otherwise the OS gets confuzzled about drivers from what some logs appeared.

Im not sure why ubuntu with openzfs was more stable than truenas, but its irrelevant at this point.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
UPDATE/Resolved.


Turned out to be a bios setting.

Supermicro has some intel VMD service that needs to be disabled. They have this VROC/Intel based software raid junk, even if your not using it, theres a bios setting that still needs to be disabled thats 3 pages deep. Otherwise the OS gets confuzzled about drivers from what some logs appeared.

Im not sure why ubuntu with openzfs was more stable than truenas, but its irrelevant at this point.

Glad you got to the bottom of this. Is it now stable in both TrueNAS CORE and SCALE?

For those wanting to disable Intel VROC (Virtual RAID on CPU) or verify its state on their Supermicro board, please see the PDF manuals below for the X11/X12 series respectively:


 
Top