Attempting to Offline a disk exhibiting errors breaks iSCSI

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
Hello all,

I've had this issue twice now in the past month and I think I'm going to pre-emptively cycle around some disks as quite a few of my disks are from the same build date, but back to the issue. I first get an alarm that a disk is causing degraded IO performance:
Device /dev/gptid/ff61160b-72c0-11eb-8b6f-0cc47a63b90c is causing slow I/O on pool Warm_HDD_Pool

This error (and the one on the previous occasion) preceded an irrecoverable error five minutes later:
Pool Warm_HDD_Pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected

In this situation, all service activity continues fine and I prepare to replace the disk, crack out a new one and get it in its sled. I aim to go through the recommended procedure which is to OFFLINE the disk ready for replacement, eject the disk, put in new one, go to replace in the pool for the failed disk, it's replaced and resilvering starts no problem.

However, I press OFFLINE and it just hangs with the twirly gig with no response from the console (directly), via SSH or via the webdmin. I cannot get any response from anything relying on middlewared (which is practically everything) and no zpool status information.

I then notice I start getting Zabbix monitoring outage failures for my VMWare Cluster where VMs are no longer responding to heartbeat messages, I can initiate an RDP session but cannot complete the connection. Then I notice the ESXi disk monitoring showing *Zero* activity meaning the iSCSI link had stopped responding, it was active but nothing was going in or out. The latency briefly spike to ~80,000ms and then nothing:
1653475873431.png


The fluctuation at 07:00 is when the storage server stated it failed to read SMART values from the failing disk:
Device: /dev/da3, failed to read SMART values.

The first time this issue happened, I went to the storage server and all but one of the disk activity lights were solid and I could hear every disk running at full pelt but then the disk that had failed was dark. So I broke procedure and yanked the disk, all activity returned to normal, the storage server unlocked itself and all services returned to normal.

This time, I was not in the office and I just had to wait it out and after half an hour i got:
* smartd is not running.
* Pool Warm_HDD_Pool state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
The following devices are not healthy:

  • Disk SEAGATE ST4000NM0023 Z1Z8GHLL0000C536E5P6 is OFFLINE
Services had returned to normal but the UI still said the disk was ONLINE. At this point I left it till I was in the office so I could prepare the replacement disk. I mistakenly detached instead of replaced but either way I was able to get the vDev back into a mirror no problem and it started resilvering.

Attached are excerpts of the log data at the time of attempting to OFFLINE the faulting disk:
  • daemon.log - This one starts with making a casefile very early on but after messages exhibits iSCSI errors
  • messages - This log is the most damning and very early on the iSCSI service is detecting write errors:
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): WRITE(10). CDB: 2a 00 47 63 40 28 00 00 20 00
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): CAM status: SCSI Status Error
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): SCSI status: Check Condition
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): SCSI sense: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): Info: 0x47634028
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): Field Replaceable Unit: 8
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): Command Specific Info: 0x7004f00
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): Actual Retry Count: 23
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): Retrying command (per sense data)
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): WRITE(10). CDB: 2a 00 47 63 40 28 00 00 20 00
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): CAM status: SCSI Status Error
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): SCSI status: Check Condition
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): SCSI sense: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): Info: 0x47634028
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): Field Replaceable Unit: 8
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): Command Specific Info: 0x7004f00
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): Actual Retry Count: 23
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
May 25 06:42:01 stor-01.domain.com (da3:mpr0:0:11:0): Retrying command (per sense data)
  • middlewared.log - doesn't look like anything special except for webmin/socket errors because everything was getting locked up
The hardware specs and version are:
  • X10DRU-i+
  • Dual Xeon e5-2620v3
  • 128GB DDR4-2166 ECC
  • LSI 3008 HBA
  • SAS 6gbps backplane (i know, it's a pita)
  • 12x ES.3 4TB SAS HDDs in 6 vdev stripe of mirrored pairs
    • 16GB Optane SLOG
    • 2TB NVMe Cache
  • TrueNAS Core 12.0-U8
Current zpool status -v output:
pool: Warm_HDD_Pool
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed May 25 10:04:20 2022
11.8T scanned at 1.43G/s, 9.78T issued at 1.19G/s, 11.8T total
15.4G resilvered, 83.16% done, 00:28:22 to go
config:

NAME STATE READ WRITE CKSUM
Warm_HDD_Pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/005c7367-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
gptid/01411de8-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/00176797-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
gptid/a83c83bb-dc09-11ec-8b33-0cc47a63b90c ONLINE 0 0 0 (resilvering)
mirror-2 ONLINE 0 0 0
gptid/01abc09b-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
gptid/01d48b31-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/7bf2ef84-d05d-11ec-8b33-0cc47a63b90c ONLINE 0 0 0
gptid/017356ce-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
gptid/017ddc83-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
gptid/01a302f7-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gptid/02634146-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
gptid/9690208f-478c-11ec-9686-0cc47a63b90c ONLINE 0 0 0
logs
gptid/0152abbd-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0
cache
gptid/026cf090-72c1-11eb-8b6f-0cc47a63b90c ONLINE 0 0 0

errors: No known data errors

To summarise all of the above:

Why is OFFLINE-ing a failed/degraded disking causing the iSCSI service to stop responding and subsequently cause issues to our VMWare server? This time, it corrupted a VM and thankfully no changes had been made since the 4am backup so I was able to restore it without any service disruption but it's still highly concerning. It could be I'm asking the wrong question and OFFLINE-ing the disk was not the cause, only part of the symptomatic issues with the disk array going unhealthy, but it's too coincidental that on two occasions a disk degrades, faults, I OFFLINE it and then iSCSI (and the TrueNAS as well) goes unresponsive and causes issues for the virtual host.

Hopefully I've managed to provide enough information to help with analysis! :)

Many thanks,
Emile
 

Attachments

  • daemon.txt
    188.4 KB · Views: 108
  • middlewared.txt
    35.3 KB · Views: 153
  • messages.txt
    356.2 KB · Views: 115

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Why is OFFLINE-ing a failed/degraded disking causing the iSCSI service to stop responding and subsequently cause issues to our VMWare server?
A failed disk is already offline.

It could be I'm asking the wrong question and OFFLINE-ing the disk was not the cause, only part of the symptomatic issues with the disk array going unhealthy
I agree.

it's too coincidental that on two occasions a disk degrades, faults, I OFFLINE it and then iSCSI (and the TrueNAS as well) goes unresponsive and causes issues for the virtual host.
I agree that it's not normal/expected that a disk going offline by fault or on purpose isn't supposed to cause a pool to become unstable.

I would suggest that it may be some kind of trouble with your HBA handling the doubtless innumerable retries to a misbehaving disk if the pool is busy.

2TB NVMe Cache
I can imagine this raising some eyebrows as being an enormous L2ARC size, even with 128GB of RAM. Are your ARC hits really low enough to warrant even any L2ARC at all? (let alone 2 TB of it)
 

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
A failed disk is already offline.
Unusual that the UI (and I think I did a zpool status -v the first time it happened) still stated it as ONLINE and gave the option to OFFLINE it, could be a UI error and then it was trying to OFFLINE the disk when it was already in that state and got itself mucked up?

But as I've mentioned above, it said it was still "ONLINE"

I agree that it's not normal/expected that a disk going offline by fault or on purpose isn't supposed to cause a pool to become unstable.

I would suggest that it may be some kind of trouble with your HBA handling the doubtless innumerable retries to a misbehaving disk if the pool is busy.
Not sure how to find that information, I am tempted to suspend all VM activity during a maintenance window and OFFLINE a disk to see if it disrupts iSCSI but what is also interesting is that it says it fails to complete a SMART health check sas I attempted to OFFLINE the disk:

New alerts:
* Device: /dev/da3, failed to read SMART values.

Current alerts:
* Device /dev/gptid/ff61160b-72c0-11eb-8b6f-0cc47a63b90c is causing slow I/O on pool Warm_HDD_Pool.
* Pool Warm_HDD_Pool state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
* Device: /dev/da3, failed to read SMART values.
Could it be that it was attempting to read SMART data while the disk was failing but the next SHORT test isn't due till tomorrow morning and LONG test not till the 3rd of next month. When a disk is exhibiting failures, is it TrueNAS logic to then examine SMART data (would make sense if it did)?

However, the anathema to your suggestion, potentially, is that the iSCSI was detecting bad health long before TrueNAS alerted me to, the first iSCSI message dictated:
May 25 06:41:58 stor-01.domain.com (da3:mpr0:0:11:0): SCSI sense: HARDWARE FAILURE asc:3,0 (Peripheral device write fault)
The first TrueNAS alert was 06:47 for slow I/O, but that could be just TrueNAS picking up the error in a different way but the SCSI sense is just more sensitive?

I can imagine this raising some eyebrows as being an enormous L2ARC size, even with 128GB of RAM. Are your ARC hits really low enough to warrant even any L2ARC at all? (let alone 2 TB of it)
That would a question for another thread, short and sweet is that it was going spare with no obvious future need so I threw it in, I will potentially move it to a mirrored pair of 500GB/1TB disks but haven't decided but that's not the subject of this thread atm :)
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
Have a look at dmesg while the problem is happening (run it a few times) and see the CAM messages.
I'll try to remember to do that if/when the issue is happening.

Hopefully the log data I've provided can shed some light on the issue as to what is happening outside of suspecting it just being the HBA.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Your included messages log already shows a good example with dozens of aborted commands per second.
 

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
Your included messages log already shows a good example with dozens of aborted commands per second.

Are those aborted commands solely due to an overloaded HBA when the disk pool is busy?

Before I start investigating whether the HBA or TrueNAS is at fault, I just need to know whether that is solely the case.

Edit: It looks like a lot of dmesg data is still there and is left over from when the last disk failed (da6, todays was da3), I have attached the dmesg output.
 

Attachments

  • dmesgoutput.txt
    317.5 KB · Views: 106
Last edited:

Emile.Belcourt

Dabbler
Joined
Feb 16, 2021
Messages
23
Are there any further details from the dmesg output I have provided?

If the HBA is struggling, why did the struggles start to happen when I attempted to OFFLINE a disk showing as ONLINE even though as it had faulted it should have already been OFFLINE. It just seems a little nonsensical if the drive is already offline that this happened that way and TrueNAS itself, not the HBA locked up. The TrueNAS lockup could have been a symptom of the misbehaving HBA but if so then I either need to prove fault with the HBA or raise an Issue to ixsystems.
 
Top