Pool degraded on Disk4 but SMART says Disk1

ianch

Cadet
Joined
Jan 30, 2022
Messages
5
Hi

I am running TureNAS core 12.0-U7 with a RAIDZ1 pool consisting of four 3Tb WD Red NAS drives. Server is a DELL R730XD with Perc H330. The Perc has a RAID1 of two SSD disks used of the TrueNAS OS.

10 days ago the TrueNAS pool reported it was degraded and that Disk4 was the issue.

So I replaced Disk4 and waited 7 days for the resilvering to finish. Seemed rather a long time to me on Dell hardware.

Anyhow..... The resilver has completed and the pool is still showing degraded with Disk4.
Strange because I used HDDScan on my PC to check the disk first and it passed all tests.

Yesterday I did a short SMART test on all four disks and Disk1 failed but Disk 2, 3 & 4 passed. I ran a long test overnight and got the same results.

What should I do? Replace Disk4 again? or replace Disk1?

I have tried moving Disk4 to a different slot in the server and same results.

Additional info...

zpool status

pool: Vol1
state: DEGRADED
config:
NAME STATE READ WRITE CKSUM
Vol1 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/bad87657-69f1-11e5-9919-c4e984000afe ONLINE 0 0 0
gptid/bbe358e8-69f1-11e5-9919-c4e984000afe ONLINE 0 0 0
gptid/bcf2ac3e-69f1-11e5-9919-c4e984000afe ONLINE 0 0 0
gptid/86e1e243-7b8e-11ec-ba46-1866da51462d DEGRADED 0 0 0 too many errors


zpool iostat -v Vol1
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------------- ----- ----- ----- ----- ----- -----
Vol1 7.79T 3.09T 3 37 26.0K 455K
raidz1 7.79T 3.09T 3 37 26.0K 455K
gptid/bad87657-69f1-11e5-9919-c4e984000afe - - 0 9 6.47K 115K
gptid/bbe358e8-69f1-11e5-9919-c4e984000afe - - 0 9 6.18K 112K
gptid/bcf2ac3e-69f1-11e5-9919-c4e984000afe - - 0 9 6.82K 115K
gptid/86e1e243-7b8e-11ec-ba46-1866da51462d - - 0 8 6.55K 113K
---------------------------------------------- ----- ----- ----- ----- ----- -----


smartctl -a /dev/da1

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 55309 22465496
# 2 Short offline Completed: read failure 90% 55294 22465496

smartctl -a /dev/da2

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 55316 -
# 2 Short offline Completed without error 00% 55294 -

smartctl -a /dev/da3

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 55316 -
# 2 Short offline Completed without error 00% 55294 -

smartctl -a /dev/da4

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 187 -
# 2 Short offline Completed without error 00% 165 -
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
glabel status
Then compare the GPTID vs the /dev/???
The run smart on the failing device, also you should be able to determine the S/N of the device
 

ianch

Cadet
Joined
Jan 30, 2022
Messages
5
Many thanks for your reply.

glabel confirms the pool is talking about disk4 and the smart report is still talking about disk1

See below

glabel status
Name Status Components
gptid/9617d799-db5b-11eb-9623-1866da51462d N/A da0p1
gptid/bad87657-69f1-11e5-9919-c4e984000afe N/A da1p2
gptid/bcf2ac3e-69f1-11e5-9919-c4e984000afe N/A da2p2
gptid/bbe358e8-69f1-11e5-9919-c4e984000afe N/A da3p2
gptid/86e1e243-7b8e-11ec-ba46-1866da51462d N/A da4p2
gptid/961ad1d5-db5b-11eb-9623-1866da51462d N/A da0p3

Surely if both disks were failing the pool would fall apart. Yet all the data is still there (and yes backed up for days like this)

 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
You are misinterpreting this information - not using the correct process.

From your SMART data you can see that drive da1 is the problem drive.

Now you have to find that drive physically in your TrueNAS box (the information you have so far does not tell you which physical drive it is - for that you need the drive serial number. To get that goto to Storage>Disks and find the HDD Serial Number for da1.

Open your server and look at the drives to find the drive the OS identifies as da1.

Use the GUI to replace that drive.
 

ianch

Cadet
Joined
Jan 30, 2022
Messages
5
You could be right and I am clearly missing something.

The GUI is showing disk4 degraded. see below

1643565515020.png


Editing disk4 shows the below serial number

1643565578062.png


That serial number matches the disk4 in the smart information. The smart information says no errors.

Editing the Disk1 shows the below serial number, which likewise matches the Smart information with errors.

1643565692004.png


Below is the Storage->Disk information

1643565800621.png


Any ideas???
 

ianch

Cadet
Joined
Jan 30, 2022
Messages
5
In addition to my last message

If I power down and remove what I believe is disk1 and power up da1 shows a missing.

Likewise the same for disk4 is da4.

So do I once again replace da4 because TrusNAS is showing that disk as degraded, and then once resilvering is complete then change da1 which is reporting SMART errors?
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
To be sure use the serial number from the Smart report because the "daX" s can change on every reboot.
 

ianch

Cadet
Joined
Jan 30, 2022
Messages
5
So which disk do I change?

da1 points to a disk with smart errors but looks good in truenas.

da4 points to a disk with no smart errors but truenas shows it as degraded.

I've double checked the WD serials against the smart reports and checked they do correspond to the daX I think it should be.

To be honest, I've now lost confidence in both disks and will swap them out, but which first?

I'm guessing da4 as that's what's reporting as degraded, then da1.

I guess worst case I lose the pool, but I've got backups.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Trust serial numbers and disk errors reported by smartctl. Replace failing disks. Also suggest filing a Jira bug before you do so with a debug attached, so someone at iX can look to see if there's some presentational issue in the GUI.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
smartctl -a /dev/da4

SMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 187 -
# 2 Short offline Completed without error 00% 165 -
Just to clarify for you, this doesn't indicate there are no issues with that disk...

A completed test doesn't imply no errors/problems found with the disk, just that the test was able to complete in full.

For da1, the test couldn't complete in full, so you see a result to match.

smartctl -A /dev/da4 would show if there are really some things to worry about or not.
 
Top