Can't Offline Disk

Warren Ince · Jul 18, 2023

Hiya,

I have a failed disk that I'm trying to offline (following below link) before replacing but nothing happens when we "Select Confirm to activate the OFFLINE button, then click OFFLINE. The disk should now be offline"

Disk Replacement

Describes how to replace a disk and restore the hot spare in TrueNAS CORE.

www.truenas.com

Version: TrueNAS-12.0-U3
Supermicro (Family SMC X10)
Product Name: X10DRH-iT
2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz

Has anyone else experienced this? Zpool status below.

root@LON-ESXi-NAS-02[~]# zpool status
pool: SAS-3-Mirror-SLOG-Cache
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 08:09:18 with 0 errors on Sun Jul 2 08:09:26 2023
config:

NAME STATE READ WRITE CKSUM
SAS-3-Mirror-SLOG-Cache DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
gptid/3e17d4f3-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/3f834ec0-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/34d2add4-e3fd-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/3eb510eb-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/41136a76-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/414b449f-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-2 DEGRADED 0 0 0
gptid/402f7a69-9df6-11eb-af7d-3cfdfebaf4f0 FAULTED 72 0 72.0M too many errors
gptid/431a58ec-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/435cb0bd-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/41876105-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/44776db4-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/44c4412d-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
gptid/4397a148-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/45d7cbf9-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4738a27d-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gptid/46bb93b2-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/488689ed-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4968a4b5-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
gptid/4aa8b5fa-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4b6d8656-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4baf8a36-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
gptid/4c49a23a-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4c2f09be-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4c9b3374-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
logs
mirror-8 ONLINE 0 0 0
gptid/4b6a05ba-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
cache
gptid/4bf21e06-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:03 with 0 errors on Sun Jul 16 03:45:04 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors
root@LON-ESXi-NAS-02[~]#

HoneyBadger · Jul 18, 2023

Hi @Warren Ince

My understanding of the UI here is that the FAULTED status will always be displayed over OFFLINE, to indicate the presence of a bad disk or connection, as opposed to "this disk is healthy, but was manually set to OFFLINE by an adminstrator."

If you have a spare disk installed in the system, you can start the replacement process right away - if you need to remove the faulted disk, you'll want to identify it physically. Based on the other components, I'm hoping your hardware has support for the sesutil command suite, so give a shot to the following commands in the shell to find it:

gpart list | grep -B 6 '402f7a69-9df6-11eb-af7d-3cfdfebaf4f0'

This will look for the rawuuid of the FAULTED drive and print the preceding 6 lines as well. Look at the first line for the Name: daXpY to locate the physical drive, then replace daX with your drive and enter:

sesutil fault daX on

If your backplane speaks SES, it'll light up the "faulted drive" LED for that bay. You can also try sesutil locate daX on if your bays only have a "locate" LED.

Let me know how this works out!

stephen_eaves · Sep 12, 2023

Hi HoneyBadger,
I work with Warren who has asked to takeover this case.
The TrueNAS we currently use does not support SES
Can you tell what alternative is available for SES, which will show us where the disk da6p2 is currently located?
Kind regards
Steve

HoneyBadger · Sep 12, 2023

Hi @stephen_eaves

If the enclosure doesn't have SES functionality, but does have individual drive bay LEDs, you can use a command-line to artificially generate some read traffic to the drive with dd if=/dev/da6p2 of=/dev/null bs=1M count=10000 - and then check for the drive with the flickering light.

Alternatively, if there's regular drive activity to the pool already, look for the drive that doesn't have any activity (because it's in FAULTED state) and that's likely your candidate.

If there's no individual bay LEDs, then you may want to count the drive bays in your enclosure, starting from the "first physical bay" being da0, and hazard a guess that the seventh drive is the faulted one.

Johnny Fartpants · Sep 12, 2023

How about sas2ircu or sas3ircu 0 display?

Johnny Fartpants · Sep 13, 2023

smartctl -a /dev/da6 will give you the serial number for the drive and if your system supports sas2ircu or sas3ircu you can do the following:

sas3ircu 0 display | grep SERIALNUMBER -A2 -B8

This will give you the location of the drive and then you can turn the light on with:

sas3ircu 0 locate 2:0 ON

stephen_eaves · Sep 15, 2023

HoneyBadger said:
Hi @stephen_eaves

If the enclosure doesn't have SES functionality, but does have individual drive bay LEDs, you can use a command-line to artificially generate some read traffic to the drive with dd if=/dev/da6p2 of=/dev/null bs=1M count=10000 - and then check for the drive with the flickering light.

Alternatively, if there's regular drive activity to the pool already, look for the drive that doesn't have any activity (because it's in FAULTED state) and that's likely your candidate.

If there's no individual bay LEDs, then you may want to count the drive bays in your enclosure, starting from the "first physical bay" being da0, and hazard a guess that the seventh drive is the faulted one.

Hi Honeybadger,

We are using a Supermicro SSG-6048R-E1CR24H chassis which only has a static blue light and shows no disk activity.
So the command dd if=/dev/da6p2 of=/dev/null bs=1M count=10000[/icode] did work but nothing was shown on the disk blue light.
We have contacted Supermicro support who has told us the position of disk6 in the Chassis.
Should we go with there advise or is there another way we can confirm where Disk6 is located in the Chassis?

Kin regards

Steve

stephen_eaves · Sep 18, 2023

We now know Disk 6 is in Slot 6. Disk 6 is being used as a Cache Disk. Is it safe to remove Disk 6 and replace it without putting the ESXi Hosts into Maintenace mode?

Kind regards

Steve

HoneyBadger · Sep 18, 2023

stephen_eaves said:
We now know Disk 6 is in Slot 6. Disk 6 is being used as a Cache Disk. Is it safe to remove Disk 6 and replace it without putting the ESXi Hosts into Maintenace mode?

Kind regards

Steve

Morning Steve,

Apologies for not getting back to you sooner. Did you try the commands from @Johnny Fartpants above to see if it enables a locator LED on your chassis? The SSG-6048R-E1CR24H should support either sesutil or those commands - however, I did note that the Supermicro page indicates that it ships with a SAS3108 hardware RAID controller instead of an HBA.

Based on your zpool status output the failed device is in the mirror-2 vdev in your pool, not a cache disk, so I would double-check the starting position. TrueNAS counts disks from da0, so if the Supermicro chassis starts from "1" then you'll want the seventh disk, not the sixth. Would that line up with a data drive?

Your pool uses three-way-mirror vdevs, so you do have the extra assurance that removing the "wrong disk" should not result in data loss.

stephen_eaves · Sep 19, 2023

The above shows out failed disk da6p2 . I have been asked if we accidently remove disk da7p2 would we loss all the data on Disk da8p2?

HoneyBadger · Sep 19, 2023

stephen_eaves said:
View attachment 70348
The above shows out failed disk da6p2 . I have been asked if we accidently remove disk da7p2 would we loss all the data on Disk da8p2?

Hi Stephen,

Your pool is configured with three-way mirrors, so even with the single FAULTED disk in the mirror-2 vdev, accidentally removing the da7 disk would result in the da8 disk remaining active.

Code:

mirror-2                                     DEGRADED     0     0     0
  gptid/402f7a69-9df6-11eb-af7d-3cfdfebaf4f0 FAULTED     72     0 72.0M too many errors
  gptid/431a58ec-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE       0     0     0
  gptid/435cb0bd-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE       0     0     0

I would suggest having a console session open, and immediately running the zpool status command after removing the drive to verify that the correct drive was removed.

stephen_eaves · Jan 3, 2024

Hi Honeybadger,

Happy new year.

We are still having issues after replacing the faulty disk on the 29th December 2023.

We are still seeing error, which maybe from before the disk was replaced.

The new da6 disk appears to be working after doing a test file transfer as follows -

root@LON-ESXi-NAS-02[~]# dd if=/dev/da6p2 of=/dev/null bs=1M count=10000

10000+0 records in

10000+0 records out

10485760000 bytes transferred in 116.318690 secs (90146820 bytes/sec)

root@LON-ESXi-NAS-02[~]#

I also run gpart list da6, which says the disk status is OK as follows –

Geom name: da6

modified: false

state: OK

fwheads: 255

fwsectors: 63

last: 19134414807

first: 40

entries: 128

scheme: GPT

I also run the following command -

root@LON-ESXi-NAS-02[~]# smartctl -a /dev/da6
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST10000NM0096
Revision: NEB0
Compliance: SPC-4
User Capacity: 9,796,820,402,176 bytes [9.79 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5009413dfdf
Serial number: ZA21MWAJ0000C7354RX7
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Jan 3 10:54:45 2024 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=42]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 15409
Power on minutes since format <not available>
Current Drive Temperature: 38 C
Drive Trip Temperature: 40 C

Accumulated power on time, hours:minutes 46806:30
Manufactured in week 23 of year 2017
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 257
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 2194
Elements in grown defect list: 1561

Vendor (Seagate Cache) information
Blocks sent to initiator = 3647756032
Blocks received from initiator = 623937728
Blocks read from cache and sent to initiator = 1753466531
Number of read and write commands whose size <= segment size = 435804471
Number of read and write commands whose size > segment size = 11743923

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 46806.50
number of minutes until next internal SMART test = 40

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 2312289692 637 0 2312290329 648 104632.451 4
write: 0 0 11 11 875 137023.945 0
verify: 1888 0 0 1888 0 0.000 0

Non-medium error count: 11975

scsiPrintSelfTest: Failed [scsi response fails sanity test]
root@LON-ESXi-NAS-02[~]#

However, the Dashboard is still saying da6 is FAULTY. However, the Dashboard is showing the date as 24th December 2023, which I only replaced the Disk on the 29th December 2023. We have tried to upgrade the Dashboard but it is still showing old data.

I have also enabled priority resilver.

So I have several questions as follows -

Is the new disk working correctly?
Should the new disk automatically add itself to the Correct pool?
How do I force the Dashboard to update itself?

Kind regards

Steve

HoneyBadger · Jan 3, 2024

Hello Stephen,

Try a force-refresh (Ctrl+F5) of the dashboard page, clear your browser cache, or use Incognito/Private mode to see if it's a locally cached issue causing the dashboard to be out of date.

The SMART results do have a pre-failure warning on that drive as shown by the lines:

stephen_eaves said:
=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=42]

Elements in grown defect list: 1561

Are you positive that you've replaced the faulted drive with the serial number in that SMART listing? It appears from the screenshots that the fault count is still growing.

Once the drive has been physically replaced, you will need to use the Replace option in the pool status on the old/removed da6 to have it take the place of the failed drive in the pool, as shown in:

Disk Replacement

Describes how to replace a disk and restore the hot spare in TrueNAS CORE.

www.truenas.com

Important Announcement for the TrueNAS Community.

Can't Offline Disk

Warren Ince

Cadet

Disk Replacement

HoneyBadger

actually does care

stephen_eaves

Cadet

HoneyBadger

actually does care

Johnny Fartpants

Guru

Johnny Fartpants

Guru

stephen_eaves

Cadet

stephen_eaves

Cadet

HoneyBadger

actually does care

stephen_eaves

Cadet

HoneyBadger

actually does care

stephen_eaves

Cadet

Attachments

HoneyBadger

actually does care

Disk Replacement

Similar threads

Important Announcement for the TrueNAS Community.

Can't Offline Disk

Cadet

actually does care

Cadet

actually does care

Guru

Guru

Cadet

Cadet

actually does care

Cadet

actually does care

Cadet

Attachments

actually does care

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Can't Offline Disk"

Similar threads