TrueNAS Alert: Device is causing slow I/O on pool tank. On all HDDs randomly.

Hackslash

Cadet
Joined
Jun 2, 2021
Messages
9
Long time listener, first time caller. I've seen similar threads but I haven't found a good explanation for what causes this alert; what is the threshold or criteria. Nor have I found resolution for my situation.

I have a zpool with an array of 5 disks. Ever since I turned on email alerts I get around 4 alerts daily with the same message "Device is causing slow I/O on pool tank" citing various disks. Sometimes it's one disk. Sometimes it's all 5. The IDs are constantly changing between slow and cleared. Sometimes it happens 5 days in a row. Sometimes it goes 5 days with no alerts. I say "randomly" because I have yet to see the pattern. There may be some system task, or scheduled task that is causing it.

Questions:
  • What is the criteria that throws this alert?
  • Is it configurable?
  • This this a normal alert that happens when certain task run, or does it indicate a problem?
  • Should I be trying to tune this alert out? Or should I be troubleshooting a hardware issue?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Maybe a few points of courtesy first... what hardware are you running? ... what version of TrueNAS?

Most importantly, are those disks SMR?
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
They're on the list as DM-SMR.
For some reason I hadn't seen the hardware in the signature earlier... good that you did the work to check them.

That will certainly be the definitive explanation of the error and the fix for it is to replace the SMR drives with CMR ones.
 

Hackslash

Cadet
Joined
Jun 2, 2021
Messages
9
Thanks! So that answers some of the questions. Yes, it's a hardware issue. Due to choosing cheap disks.
I'm still curious about what the criteria is for this alert.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I'm still curious about what the criteria is for this alert.
dmesg will probably show it to you...

You will see a bunch of CAM messages indicating timeout for one or other of the disks performing write operations.

I don't know the specific thresholds for how many need to appear before you get the alert, but I don't think it's many.
 
Top