Very slow performance after faulty disk

Sentinal-Sys · Feb 19, 2024

Good morning, everyone,

I have a small problem or rather a question of understanding.

I have a TrueNAS Core running here at home with DELL R720xd hardware. I have a total of 12 x 8 TB HDDs in a ZFS pool which consists of 3 x 4 RAID-Z1.

Yesterday one of these HDDs seems to have broken. At least I got several warnings that the pool is "unhealty". Well, the system continued to run and all the data was still there, but it was incredibly slow. All my VMs, which I stored on an NFS share on the NAS, got kernel panic or were just insanely slow.
I rebooted everything but it only got worse. In the end, I pulled the broken disk out of the system and at that moment everything was back to normal. Well, except for a "degraded" pool.

What I don't understand is why everything was so slow. Shouldn't the RAID-Z1 prevent exactly that? I mean okay, data is still there and I can work normally again. So it seems to have worked somehow.

Can I look somewhere where the problem was and possibly prevent this from happening in the future?

Thank you very much

Greetings
Joe

chuck32 · Feb 20, 2024

Sentinal-Sys said:
What I don't understand is why everything was so slow. Shouldn't the RAID-Z1 prevent exactly that?

No, RAIDZ1 saved you from losing your entire data on the pool. In a RADIZ configuration you have no unique parity drive, the data is spread across the drives, hence with a faulty drive it's not surprising that the whole pool may take a performance hit.

Sentinal-Sys said:
I rebooted everything but it only got worse. In the end, I pulled the broken disk out of the system and at that moment everything was back to normal. Well, except for a "degraded" pool.

Replace the broken disk ASAP. Make sure you have current backups.

In my personal opinion RAIDZ1 is not enough for a 12 wide pool. Consider RAIDZ2 or RAIDZ3 in the future (you have to rebuild the whole pool for that though).

Sentinal-Sys · Feb 20, 2024

Good morning, everyone,

Thank you for your quick reply.

I already know how RAID-Z1 works and that the parities are distributed across all disks. But now one disk is missing and the missing information must be calculated from the parities of the remaining disks. But that is exactly what is not slow.

The pool was only slow with the broken disk. The only thing I can explain is that the system still tried to read the data on the broken disk and did not use the parity. And the broken disk slowed down the whole system.

I've already ordered a new disk and I'm also going to buy one or two spares.

Why do you think I should go for a RAID-Z2? I have three VDEVs with four disks each. Each in RAID-Z1. So one disk per VDEV can fail. I think the price/performance ratio is actually quite good.

And yes, I even have several backups.

Greetings
Joe

chuck32 · Feb 20, 2024

Sentinal-Sys said:
Why do you think I should go for a RAID-Z2? I have three VDEVs with four disks each.

Sorry, hadn't had my coffee yet. Forget what I said, you are right, I misread that you had a single 12 wide Vdev.

chuck32 · Feb 20, 2024

Sentinal-Sys said:
The only thing I can explain is that the system still tried to read the data on the broken disk and did not use the parity. And the broken disk slowed down the whole system.

I'd argue differently, when the system tries to read the faulty data it cannot read it and has to reconstruct from parity every time, which introduces a higher load.

sretalla · Feb 20, 2024

Nope...

What's likely is that while that bad disk was still in the system, it was wreaking havoc with your SATA controller, which was interfering with all other disks' access to the system (probably causing frequent controller resets... maybe dmesg or messages archive files will give you some clues about that).

That may be an indicator that a controller firmware upgrade might be helpful, or maybe that you need to think about other risk mitigations, but either way, removing a misbehaving disk ASAP is recommended in cases such as yours.

Sentinal-Sys · Feb 20, 2024

Hello everyone,

here is a small excerpt from dmesg. This was constantly repeated.

Unfortunately, I can't do too much with it. Maybe someone can explain the excerpt to me.

(da10:mps0:0:19:0): CAM status: CCB request completed with an error
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 19 SMID 2095 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 254 loginfo 31170000
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 d0 82 a4 88 00 00 00 08 0000
(da10:mps0:0:19:0): CAM status: CCB request completed with an error
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 96 91 c2 48 00 00 00 28 0000
(da10:mps0:0:19:0): CAM status: CCB request completed with an error
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 19 SMID 679 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 282 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 301 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 383 loginfo 31170000
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 e5 0c 80 d8 00 00 00 08 00

I would be particularly interested in the topic of updating the SAS controller. I only flashed the installed DELL PERC H710 mini into IT mode a good year ago. I used the following instructions for this: https://fohdeesha.com/docs/H710-B0.html

Is there anything more up-to-date?

Greetings
Joe

sretalla · Feb 20, 2024

Sentinal-Sys said:
Is there anything more up-to-date?

I think 20.00.07 is the right one for that generation of LSI chip.

Sentinal-Sys said:
Maybe someone can explain the excerpt to me.

As I understand it, you're looking at the controller trying to get some data from that bad disk,

(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 d0 82 a4 88 00 00 00 08 0000

failing,

(da10:mps0:0:19:0): CAM status: CCB request completed with an error

then retrying...

(da10:mps0:0:19:0): Retrying command, 3 more tries remain

I suspect that cycle of retries was somehow blocking the whole controller or causing it to reset... probably indicated by:

mps0: Controller reported scsi ioc terminated tgt 19 SMID 383 loginfo 31170000

I would say with a relatively solid confirmation of the firmware version, we might think about the controller itself being fried? (have you checked the temps on it?)

Although I know it may just be a single bad disk, it shouldn't produce the result you saw in normal circumstances.

Sentinal-Sys · Feb 20, 2024

I would say with a relatively solid confirmation of the firmware version, we might think about the controller itself being fried? (have you checked the temps on it?)

Hello,

Thank you for your reply. Can you give me a quick guide? I haven't looked at the controller for what feels like an eternity. How can I read out the temperatures or the firmware version?

Greetings
Joe

HoneyBadger · Feb 20, 2024

sretalla said:
Although I know it may just be a single bad disk, it shouldn't produce the result you saw in normal circumstances.

The challenge here is that a "failed" disk is dropped off the device chain and won't cause the retry/timeouts - but a "failing" device will still show up and interact with the SCSI layer enough to make it attempt to continue negotiating. A drive that is stuck in a firmware-level reset loop or has a short will behave in this manner - it will drop off the SCSI bus, restart, potentially receive and respond to a command or two, and then drop out again. Loose data or power cables can also cause this, but in your case with a solid backplane it's less likely to occur.

It's much better for the drive to fail completely, causing ZFS to kick it out of the array and rebuild parity from the other disks real-time (as it's doing now) on the CPU. But the decision on when to declare a disk non-responsive enough to be kicked out of the array is challenging especially if it could impact data availability.

@Sentinal-Sys are you able to file a bug through the "Report A Bug" function at the top of the page, and include a debug file (System -> Advanced -> Save Debug)? Make it clear that the bug isn't about that a drive failed, but that you experienced degraded performance during the process of the failure.

sretalla · Feb 20, 2024

Sentinal-Sys said:
Thank you for your reply. Can you give me a quick guide? I haven't looked at the controller for what feels like an eternity. How can I read out the temperatures or the firmware version?

Looks like that card is SAS2, so sas2flash -listall should get you somewhere...

I think mprutil is the right one for CORE and that version of adapters to give you temperature, mprutil show adapter | grep -i temp

if it's not enough, this should be an option:

see https://www.truenas.com/community/threads/broadcom-avago-lsi-hba-card-monitoring-ioc-temperature.94933/#post-656607 for original posting.

see https://www.truenas.com/community/threads/broadcom-avago-lsi-hba-card-monitoring-ioc-temperature.94933/#post-656607 for original posting. - lsi_temp.c

gist.github.com

Compile it in a jail then run it on the host.

Sentinal-Sys · Feb 20, 2024

HoneyBadger said:
The challenge here is that a "failed" disk is dropped off the device chain and won't cause the retry/timeouts - but a "failing" device will still show up and interact with the SCSI layer enough to make it attempt to continue negotiating. A drive that is stuck in a firmware-level reset loop or has a short will behave in this manner - it will drop off the SCSI bus, restart, potentially receive and respond to a command or two, and then drop out again. Loose data or power cables can also cause this, but in your case with a solid backplane it's less likely to occur.

It's much better for the drive to fail completely, causing ZFS to kick it out of the array and rebuild parity from the other disks real-time (as it's doing now) on the CPU. But the decision on when to declare a disk non-responsive enough to be kicked out of the array is challenging especially if it could impact data availability.

@Sentinal-Sys are you able to file a bug through the "Report A Bug" function at the top of the page, and include a debug file (System -> Advanced -> Save Debug)? Make it clear that the bug isn't about that a drive failed, but that you experienced degraded performance during the process of the failure.

Created a bug report: https://ixsystems.atlassian.net/browse/NAS-127454

Sentinal-Sys · Feb 20, 2024

sretalla said:
Looks like that card is SAS2, so sas2flash -listall should get you somewhere...

I think mprutil is the right one for CORE and that version of adapters to give you temperature, mprutil show adapter | grep -i temp

if it's not enough, this should be an option:

see https://www.truenas.com/community/threads/broadcom-avago-lsi-hba-card-monitoring-ioc-temperature.94933/#post-656607 for original posting.

see https://www.truenas.com/community/threads/broadcom-avago-lsi-hba-card-monitoring-ioc-temperature.94933/#post-656607 for original posting. - lsi_temp.c

gist.github.com

Compile it in a jail then run it on the host.

poration SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2308_2(B0)
Firmware 20.00.07.00

mprutil: mps open: No such file or directory

sretalla · Feb 20, 2024

mprutil: mps open: No such file or directory

You might have luck with storcli, but I think the version in CORE can't do temps, so it's back to the gitlab post and compile yourself the temp tool.

Sentinal-Sys · Feb 20, 2024

sretalla said:
You might have luck with storcli, but I think the version in CORE can't do temps, so it's back to the gitlab post and compile yourself the temp tool.

storcli in Core unfortunately really has no temps.

I'll have a look at the link to Git. Unfortunately I won't get around to it today. Hopefully tomorrow.

sretalla · Feb 20, 2024

sretalla said:
back to the gitlab post and compile yourself the temp tool

Maybe that sounds harder than it is...

Just these few steps:

iocage create -n "compiler" -r 13.2-RELEASE dhcp="on" bpf="on" vnet="on"
iocage start compiler
iocage console compiler
pkg install gcc (then y a time or 2)
ee lsi_temp.c (then paste the contents of the .c file from gitlab in there and esc to exit and save)
gcc lsi_temp.c -o lsi_temp

Bingo... then you have a file in your jail's home directory called lsi_temp.

Copy it somewhere on your host, then make it executable and run it like ./lsi_temp /dev/mps0

iocage destroy compiler (if you don't want to keep that jail)

Sentinal-Sys · Feb 20, 2024

sretalla said:
Copy it somewhere on your host, then make it executable and run it like ./lsi_temp /dev/mps0

Okay, got the file lsi_temp

Stupid question. How do I get this onto the host?

sretalla · Feb 20, 2024

It's actually already there, just in an iocage jail path...

/mnt/pool/iocage/jails/compiler/root/root/...

Sentinal-Sys · Feb 20, 2024

Okay got it.

IOC Temp: 76 (C)
Board Temp: 0 (N/A)

sretalla · Feb 20, 2024

High..ish, but probably not the cause of the errors.

Important Announcement for the TrueNAS Community.

Very slow performance after faulty disk

Dabbler

Guru

Dabbler

Guru

Guru

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

actually does care

Powered by Neutrality

Dabbler

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Dabbler

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Very slow performance after faulty disk"

Similar threads