Can't Offline Disk

Warren Ince

Cadet
Joined
Jul 18, 2023
Messages
1
Hiya,

I have a failed disk that I'm trying to offline (following below link) before replacing but nothing happens when we "Select Confirm to activate the OFFLINE button, then click OFFLINE. The disk should now be offline"


Version: TrueNAS-12.0-U3
Supermicro (Family SMC X10)
Product Name: X10DRH-iT
2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz

Has anyone else experienced this? Zpool status below.

root@LON-ESXi-NAS-02[~]# zpool status
pool: SAS-3-Mirror-SLOG-Cache
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 08:09:18 with 0 errors on Sun Jul 2 08:09:26 2023
config:

NAME STATE READ WRITE CKSUM
SAS-3-Mirror-SLOG-Cache DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
gptid/3e17d4f3-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/3f834ec0-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/34d2add4-e3fd-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/3eb510eb-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/41136a76-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/414b449f-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-2 DEGRADED 0 0 0
gptid/402f7a69-9df6-11eb-af7d-3cfdfebaf4f0 FAULTED 72 0 72.0M too many errors
gptid/431a58ec-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/435cb0bd-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/41876105-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/44776db4-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/44c4412d-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
gptid/4397a148-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/45d7cbf9-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4738a27d-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
gptid/46bb93b2-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/488689ed-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4968a4b5-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
gptid/4aa8b5fa-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4b6d8656-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4baf8a36-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
gptid/4c49a23a-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4c2f09be-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
gptid/4c9b3374-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
logs
mirror-8 ONLINE 0 0 0
gptid/4b6a05ba-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0
cache
gptid/4bf21e06-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE 0 0 0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:03 with 0 errors on Sun Jul 16 03:45:04 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
ada0p2 ONLINE 0 0 0

errors: No known data errors
root@LON-ESXi-NAS-02[~]#
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi @Warren Ince

My understanding of the UI here is that the FAULTED status will always be displayed over OFFLINE, to indicate the presence of a bad disk or connection, as opposed to "this disk is healthy, but was manually set to OFFLINE by an adminstrator."

If you have a spare disk installed in the system, you can start the replacement process right away - if you need to remove the faulted disk, you'll want to identify it physically. Based on the other components, I'm hoping your hardware has support for the sesutil command suite, so give a shot to the following commands in the shell to find it:

gpart list | grep -B 6 '402f7a69-9df6-11eb-af7d-3cfdfebaf4f0'

This will look for the rawuuid of the FAULTED drive and print the preceding 6 lines as well. Look at the first line for the Name: daXpY to locate the physical drive, then replace daX with your drive and enter:

sesutil fault daX on

If your backplane speaks SES, it'll light up the "faulted drive" LED for that bay. You can also try sesutil locate daX on if your bays only have a "locate" LED.

Let me know how this works out!
 
Joined
Sep 12, 2023
Messages
5
Hi HoneyBadger,
I work with Warren who has asked to takeover this case.
The TrueNAS we currently use does not support SES
Can you tell what alternative is available for SES, which will show us where the disk da6p2 is currently located?
Kind regards
Steve
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hi @stephen_eaves

If the enclosure doesn't have SES functionality, but does have individual drive bay LEDs, you can use a command-line to artificially generate some read traffic to the drive with dd if=/dev/da6p2 of=/dev/null bs=1M count=10000 - and then check for the drive with the flickering light.

Alternatively, if there's regular drive activity to the pool already, look for the drive that doesn't have any activity (because it's in FAULTED state) and that's likely your candidate.

If there's no individual bay LEDs, then you may want to count the drive bays in your enclosure, starting from the "first physical bay" being da0, and hazard a guess that the seventh drive is the faulted one.
 
Joined
Jul 3, 2015
Messages
926
How about sas2ircu or sas3ircu 0 display?
 
Joined
Jul 3, 2015
Messages
926
smartctl -a /dev/da6 will give you the serial number for the drive and if your system supports sas2ircu or sas3ircu you can do the following:

sas3ircu 0 display | grep SERIALNUMBER -A2 -B8

This will give you the location of the drive and then you can turn the light on with:

sas3ircu 0 locate 2:0 ON
 
Joined
Sep 12, 2023
Messages
5
Hi @stephen_eaves

If the enclosure doesn't have SES functionality, but does have individual drive bay LEDs, you can use a command-line to artificially generate some read traffic to the drive with dd if=/dev/da6p2 of=/dev/null bs=1M count=10000 - and then check for the drive with the flickering light.

Alternatively, if there's regular drive activity to the pool already, look for the drive that doesn't have any activity (because it's in FAULTED state) and that's likely your candidate.

If there's no individual bay LEDs, then you may want to count the drive bays in your enclosure, starting from the "first physical bay" being da0, and hazard a guess that the seventh drive is the faulted one.
Hi Honeybadger,

We are using a Supermicro SSG-6048R-E1CR24H chassis which only has a static blue light and shows no disk activity.
So the command dd if=/dev/da6p2 of=/dev/null bs=1M count=10000[/icode] did work but nothing was shown on the disk blue light.
We have contacted Supermicro support who has told us the position of disk6 in the Chassis.
Should we go with there advise or is there another way we can confirm where Disk6 is located in the Chassis?

Kin regards

Steve
 
Joined
Sep 12, 2023
Messages
5
We now know Disk 6 is in Slot 6. Disk 6 is being used as a Cache Disk. Is it safe to remove Disk 6 and replace it without putting the ESXi Hosts into Maintenace mode?

Kind regards

Steve
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
We now know Disk 6 is in Slot 6. Disk 6 is being used as a Cache Disk. Is it safe to remove Disk 6 and replace it without putting the ESXi Hosts into Maintenace mode?

Kind regards

Steve
Morning Steve,

Apologies for not getting back to you sooner. Did you try the commands from @Johnny Fartpants above to see if it enables a locator LED on your chassis? The SSG-6048R-E1CR24H should support either sesutil or those commands - however, I did note that the Supermicro page indicates that it ships with a SAS3108 hardware RAID controller instead of an HBA.

Based on your zpool status output the failed device is in the mirror-2 vdev in your pool, not a cache disk, so I would double-check the starting position. TrueNAS counts disks from da0, so if the Supermicro chassis starts from "1" then you'll want the seventh disk, not the sixth. Would that line up with a data drive?

Your pool uses three-way-mirror vdevs, so you do have the extra assurance that removing the "wrong disk" should not result in data loss.
 
Joined
Sep 12, 2023
Messages
5
1695126433775.png

The above shows out failed disk da6p2 . I have been asked if we accidently remove disk da7p2 would we loss all the data on Disk da8p2?
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
View attachment 70348
The above shows out failed disk da6p2 . I have been asked if we accidently remove disk da7p2 would we loss all the data on Disk da8p2?
Hi Stephen,

Your pool is configured with three-way mirrors, so even with the single FAULTED disk in the mirror-2 vdev, accidentally removing the da7 disk would result in the da8 disk remaining active.

Code:
mirror-2                                     DEGRADED     0     0     0
  gptid/402f7a69-9df6-11eb-af7d-3cfdfebaf4f0 FAULTED     72     0 72.0M too many errors
  gptid/431a58ec-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE       0     0     0
  gptid/435cb0bd-9df6-11eb-af7d-3cfdfebaf4f0 ONLINE       0     0     0


I would suggest having a console session open, and immediately running the zpool status command after removing the drive to verify that the correct drive was removed.
 
Joined
Sep 12, 2023
Messages
5
Hi Honeybadger,

Happy new year.

We are still having issues after replacing the faulty disk on the 29th December 2023.

We are still seeing error, which maybe from before the disk was replaced.

The new da6 disk appears to be working after doing a test file transfer as follows -

root@LON-ESXi-NAS-02[~]# dd if=/dev/da6p2 of=/dev/null bs=1M count=10000

10000+0 records in

10000+0 records out

10485760000 bytes transferred in 116.318690 secs (90146820 bytes/sec)

root@LON-ESXi-NAS-02[~]#

I also run gpart list da6, which says the disk status is OK as follows –

Geom name: da6

modified: false

state: OK

fwheads: 255

fwsectors: 63

last: 19134414807

first: 40

entries: 128

scheme: GPT

I also run the following command -

root@LON-ESXi-NAS-02[~]# smartctl -a /dev/da6
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST10000NM0096
Revision: NEB0
Compliance: SPC-4
User Capacity: 9,796,820,402,176 bytes [9.79 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5009413dfdf
Serial number: ZA21MWAJ0000C7354RX7
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Jan 3 10:54:45 2024 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=42]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 15409
Power on minutes since format <not available>
Current Drive Temperature: 38 C
Drive Trip Temperature: 40 C

Accumulated power on time, hours:minutes 46806:30
Manufactured in week 23 of year 2017
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 257
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 2194
Elements in grown defect list: 1561

Vendor (Seagate Cache) information
Blocks sent to initiator = 3647756032
Blocks received from initiator = 623937728
Blocks read from cache and sent to initiator = 1753466531
Number of read and write commands whose size <= segment size = 435804471
Number of read and write commands whose size > segment size = 11743923

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 46806.50
number of minutes until next internal SMART test = 40

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 2312289692 637 0 2312290329 648 104632.451 4
write: 0 0 11 11 875 137023.945 0
verify: 1888 0 0 1888 0 0.000 0

Non-medium error count: 11975

scsiPrintSelfTest: Failed [scsi response fails sanity test]
root@LON-ESXi-NAS-02[~]#


However, the Dashboard is still saying da6 is FAULTY. However, the Dashboard is showing the date as 24th December 2023, which I only replaced the Disk on the 29th December 2023. We have tried to upgrade the Dashboard but it is still showing old data.

I have also enabled priority resilver.

So I have several questions as follows -

Is the new disk working correctly?
Should the new disk automatically add itself to the Correct pool?
How do I force the Dashboard to update itself?

Kind regards

Steve
 

Attachments

  • da6.jpg
    da6.jpg
    38.1 KB · Views: 41

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Hello Stephen,

Try a force-refresh (Ctrl+F5) of the dashboard page, clear your browser cache, or use Incognito/Private mode to see if it's a locally cached issue causing the dashboard to be out of date.

The SMART results do have a pre-failure warning on that drive as shown by the lines:

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE DATA ERROR RATE TOO HIGH [asc=5d, ascq=42]

Elements in grown defect list: 1561

Are you positive that you've replaced the faulted drive with the serial number in that SMART listing? It appears from the screenshots that the fault count is still growing.

Once the drive has been physically replaced, you will need to use the Replace option in the pool status on the old/removed da6 to have it take the place of the failed drive in the pool, as shown in:

 
Top