Very slow performance after faulty disk

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
Good morning, everyone,

I have a small problem or rather a question of understanding.

I have a TrueNAS Core running here at home with DELL R720xd hardware. I have a total of 12 x 8 TB HDDs in a ZFS pool which consists of 3 x 4 RAID-Z1.

Yesterday one of these HDDs seems to have broken. At least I got several warnings that the pool is "unhealty". Well, the system continued to run and all the data was still there, but it was incredibly slow. All my VMs, which I stored on an NFS share on the NAS, got kernel panic or were just insanely slow.
I rebooted everything but it only got worse. In the end, I pulled the broken disk out of the system and at that moment everything was back to normal. Well, except for a "degraded" pool.

What I don't understand is why everything was so slow. Shouldn't the RAID-Z1 prevent exactly that? I mean okay, data is still there and I can work normally again. So it seems to have worked somehow.

Can I look somewhere where the problem was and possibly prevent this from happening in the future?

Thank you very much

Greetings
Joe
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
What I don't understand is why everything was so slow. Shouldn't the RAID-Z1 prevent exactly that?
No, RAIDZ1 saved you from losing your entire data on the pool. In a RADIZ configuration you have no unique parity drive, the data is spread across the drives, hence with a faulty drive it's not surprising that the whole pool may take a performance hit.

I rebooted everything but it only got worse. In the end, I pulled the broken disk out of the system and at that moment everything was back to normal. Well, except for a "degraded" pool.
Replace the broken disk ASAP. Make sure you have current backups.

In my personal opinion RAIDZ1 is not enough for a 12 wide pool. Consider RAIDZ2 or RAIDZ3 in the future (you have to rebuild the whole pool for that though).
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
Good morning, everyone,

Thank you for your quick reply.

I already know how RAID-Z1 works and that the parities are distributed across all disks. But now one disk is missing and the missing information must be calculated from the parities of the remaining disks. But that is exactly what is not slow.

The pool was only slow with the broken disk. The only thing I can explain is that the system still tried to read the data on the broken disk and did not use the parity. And the broken disk slowed down the whole system.

I've already ordered a new disk and I'm also going to buy one or two spares.

Why do you think I should go for a RAID-Z2? I have three VDEVs with four disks each. Each in RAID-Z1. So one disk per VDEV can fail. I think the price/performance ratio is actually quite good.

And yes, I even have several backups.

Greetings
Joe
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
Why do you think I should go for a RAID-Z2? I have three VDEVs with four disks each.
Sorry, hadn't had my coffee yet. Forget what I said, you are right, I misread that you had a single 12 wide Vdev.
 

chuck32

Guru
Joined
Jan 14, 2023
Messages
623
The only thing I can explain is that the system still tried to read the data on the broken disk and did not use the parity. And the broken disk slowed down the whole system.
I'd argue differently, when the system tries to read the faulty data it cannot read it and has to reconstruct from parity every time, which introduces a higher load.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Nope...

What's likely is that while that bad disk was still in the system, it was wreaking havoc with your SATA controller, which was interfering with all other disks' access to the system (probably causing frequent controller resets... maybe dmesg or messages archive files will give you some clues about that).

That may be an indicator that a controller firmware upgrade might be helpful, or maybe that you need to think about other risk mitigations, but either way, removing a misbehaving disk ASAP is recommended in cases such as yours.
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
Hello everyone,

here is a small excerpt from dmesg. This was constantly repeated.

Unfortunately, I can't do too much with it. Maybe someone can explain the excerpt to me.

(da10:mps0:0:19:0): CAM status: CCB request completed with an error
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 19 SMID 2095 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 254 loginfo 31170000
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 d0 82 a4 88 00 00 00 08 0000
(da10:mps0:0:19:0): CAM status: CCB request completed with an error
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 96 91 c2 48 00 00 00 28 0000
(da10:mps0:0:19:0): CAM status: CCB request completed with an error
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
mps0: Controller reported scsi ioc terminated tgt 19 SMID 679 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 282 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 301 loginfo 31170000
mps0: Controller reported scsi ioc terminated tgt 19 SMID 383 loginfo 31170000
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 e5 0c 80 d8 00 00 00 08 00

I would be particularly interested in the topic of updating the SAS controller. I only flashed the installed DELL PERC H710 mini into IT mode a good year ago. I used the following instructions for this: https://fohdeesha.com/docs/H710-B0.html

Is there anything more up-to-date?

Greetings
Joe
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Is there anything more up-to-date?
I think 20.00.07 is the right one for that generation of LSI chip.

Maybe someone can explain the excerpt to me.
As I understand it, you're looking at the controller trying to get some data from that bad disk,
(da10:mps0:0:19:0): READ(16). CDB: 88 00 00 00 00 01 d0 82 a4 88 00 00 00 08 0000
failing,
(da10:mps0:0:19:0): CAM status: CCB request completed with an error
then retrying...
(da10:mps0:0:19:0): Retrying command, 3 more tries remain
I suspect that cycle of retries was somehow blocking the whole controller or causing it to reset... probably indicated by:
mps0: Controller reported scsi ioc terminated tgt 19 SMID 383 loginfo 31170000
I would say with a relatively solid confirmation of the firmware version, we might think about the controller itself being fried? (have you checked the temps on it?)

Although I know it may just be a single bad disk, it shouldn't produce the result you saw in normal circumstances.
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
I would say with a relatively solid confirmation of the firmware version, we might think about the controller itself being fried? (have you checked the temps on it?)
Hello,

Thank you for your reply. Can you give me a quick guide? I haven't looked at the controller for what feels like an eternity. How can I read out the temperatures or the firmware version?

Greetings
Joe
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Although I know it may just be a single bad disk, it shouldn't produce the result you saw in normal circumstances.
The challenge here is that a "failed" disk is dropped off the device chain and won't cause the retry/timeouts - but a "failing" device will still show up and interact with the SCSI layer enough to make it attempt to continue negotiating. A drive that is stuck in a firmware-level reset loop or has a short will behave in this manner - it will drop off the SCSI bus, restart, potentially receive and respond to a command or two, and then drop out again. Loose data or power cables can also cause this, but in your case with a solid backplane it's less likely to occur.

It's much better for the drive to fail completely, causing ZFS to kick it out of the array and rebuild parity from the other disks real-time (as it's doing now) on the CPU. But the decision on when to declare a disk non-responsive enough to be kicked out of the array is challenging especially if it could impact data availability.

@Sentinal-Sys are you able to file a bug through the "Report A Bug" function at the top of the page, and include a debug file (System -> Advanced -> Save Debug)? Make it clear that the bug isn't about that a drive failed, but that you experienced degraded performance during the process of the failure.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Thank you for your reply. Can you give me a quick guide? I haven't looked at the controller for what feels like an eternity. How can I read out the temperatures or the firmware version?
Looks like that card is SAS2, so sas2flash -listall should get you somewhere...

I think mprutil is the right one for CORE and that version of adapters to give you temperature, mprutil show adapter | grep -i temp

if it's not enough, this should be an option:

Compile it in a jail then run it on the host.
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
The challenge here is that a "failed" disk is dropped off the device chain and won't cause the retry/timeouts - but a "failing" device will still show up and interact with the SCSI layer enough to make it attempt to continue negotiating. A drive that is stuck in a firmware-level reset loop or has a short will behave in this manner - it will drop off the SCSI bus, restart, potentially receive and respond to a command or two, and then drop out again. Loose data or power cables can also cause this, but in your case with a solid backplane it's less likely to occur.

It's much better for the drive to fail completely, causing ZFS to kick it out of the array and rebuild parity from the other disks real-time (as it's doing now) on the CPU. But the decision on when to declare a disk non-responsive enough to be kicked out of the array is challenging especially if it could impact data availability.

@Sentinal-Sys are you able to file a bug through the "Report A Bug" function at the top of the page, and include a debug file (System -> Advanced -> Save Debug)? Make it clear that the bug isn't about that a drive failed, but that you experienced degraded performance during the process of the failure.

Created a bug report: https://ixsystems.atlassian.net/browse/NAS-127454
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
Looks like that card is SAS2, so sas2flash -listall should get you somewhere...

I think mprutil is the right one for CORE and that version of adapters to give you temperature, mprutil show adapter | grep -i temp

if it's not enough, this should be an option:

Compile it in a jail then run it on the host.
poration SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

Adapter Selected is a LSI SAS: SAS2308_2(B0)
Firmware 20.00.07.00

mprutil: mps open: No such file or directory
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
mprutil: mps open: No such file or directory
You might have luck with storcli, but I think the version in CORE can't do temps, so it's back to the gitlab post and compile yourself the temp tool.
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
You might have luck with storcli, but I think the version in CORE can't do temps, so it's back to the gitlab post and compile yourself the temp tool.
storcli in Core unfortunately really has no temps.

I'll have a look at the link to Git. Unfortunately I won't get around to it today. Hopefully tomorrow.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
back to the gitlab post and compile yourself the temp tool
Maybe that sounds harder than it is...

Just these few steps:

iocage create -n "compiler" -r 13.2-RELEASE dhcp="on" bpf="on" vnet="on"
iocage start compiler
iocage console compiler
pkg install gcc (then y a time or 2)
ee lsi_temp.c (then paste the contents of the .c file from gitlab in there and esc to exit and save)
gcc lsi_temp.c -o lsi_temp

Bingo... then you have a file in your jail's home directory called lsi_temp.

Copy it somewhere on your host, then make it executable and run it like ./lsi_temp /dev/mps0

iocage destroy compiler (if you don't want to keep that jail)
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21
Copy it somewhere on your host, then make it executable and run it like ./lsi_temp /dev/mps0
Okay, got the file lsi_temp

Stupid question. How do I get this onto the host?
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
It's actually already there, just in an iocage jail path...

/mnt/pool/iocage/jails/compiler/root/root/...
 

Sentinal-Sys

Dabbler
Joined
Jan 9, 2020
Messages
21

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
High..ish, but probably not the cause of the errors.
 
Top