Pool degraded on Disk4 but SMART says Disk1

ianch · Jan 30, 2022

Hi

I am running TureNAS core 12.0-U7 with a RAIDZ1 pool consisting of four 3Tb WD Red NAS drives. Server is a DELL R730XD with Perc H330. The Perc has a RAID1 of two SSD disks used of the TrueNAS OS.

10 days ago the TrueNAS pool reported it was degraded and that Disk4 was the issue.

So I replaced Disk4 and waited 7 days for the resilvering to finish. Seemed rather a long time to me on Dell hardware.

Anyhow..... The resilver has completed and the pool is still showing degraded with Disk4.
Strange because I used HDDScan on my PC to check the disk first and it passed all tests.

Yesterday I did a short SMART test on all four disks and Disk1 failed but Disk 2, 3 & 4 passed. I ran a long test overnight and got the same results.

What should I do? Replace Disk4 again? or replace Disk1?

I have tried moving Disk4 to a different slot in the server and same results.

Additional info...

zpool status

pool: Vol1

state: DEGRADED

config:

NAME STATE READ WRITE CKSUM

Vol1 DEGRADED 0 0 0

raidz1-0 DEGRADED 0 0 0

gptid/bad87657-69f1-11e5-9919-c4e984000afe ONLINE 0 0 0

gptid/bbe358e8-69f1-11e5-9919-c4e984000afe ONLINE 0 0 0

gptid/bcf2ac3e-69f1-11e5-9919-c4e984000afe ONLINE 0 0 0

gptid/86e1e243-7b8e-11ec-ba46-1866da51462d DEGRADED 0 0 0 too many errors

zpool iostat -v Vol1

capacity operations bandwidth

pool alloc free read write read write

---------------------------------------------- ----- ----- ----- ----- ----- -----

Vol1 7.79T 3.09T 3 37 26.0K 455K

raidz1 7.79T 3.09T 3 37 26.0K 455K

gptid/bad87657-69f1-11e5-9919-c4e984000afe - - 0 9 6.47K 115K

gptid/bbe358e8-69f1-11e5-9919-c4e984000afe - - 0 9 6.18K 112K

gptid/bcf2ac3e-69f1-11e5-9919-c4e984000afe - - 0 9 6.82K 115K

gptid/86e1e243-7b8e-11ec-ba46-1866da51462d - - 0 8 6.55K 113K

---------------------------------------------- ----- ----- ----- ----- ----- -----

smartctl -a /dev/da1

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 55309 22465496

# 2 Short offline Completed: read failure 90% 55294 22465496

smartctl -a /dev/da2

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed without error 00% 55316 -

# 2 Short offline Completed without error 00% 55294 -

smartctl -a /dev/da3

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed without error 00% 55316 -

# 2 Short offline Completed without error 00% 55294 -

smartctl -a /dev/da4

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed without error 00% 187 -
# 2 Short offline Completed without error 00% 165 -

NugentS · Jan 30, 2022

glabel status
Then compare the GPTID vs the /dev/???
The run smart on the failing device, also you should be able to determine the S/N of the device

ianch · Jan 30, 2022

Many thanks for your reply.

glabel confirms the pool is talking about disk4 and the smart report is still talking about disk1

See below

glabel status

Name Status Components

gptid/9617d799-db5b-11eb-9623-1866da51462d N/A da0p1

gptid/bad87657-69f1-11e5-9919-c4e984000afe N/A da1p2

gptid/bcf2ac3e-69f1-11e5-9919-c4e984000afe N/A da2p2

gptid/bbe358e8-69f1-11e5-9919-c4e984000afe N/A da3p2

gptid/86e1e243-7b8e-11ec-ba46-1866da51462d N/A da4p2

gptid/961ad1d5-db5b-11eb-9623-1866da51462d N/A da0p3

Surely if both disks were failing the pool would fall apart. Yet all the data is still there (and yes backed up for days like this)

Redcoat · Jan 30, 2022

You are misinterpreting this information - not using the correct process.

From your SMART data you can see that drive da1 is the problem drive.

Now you have to find that drive physically in your TrueNAS box (the information you have so far does not tell you which physical drive it is - for that you need the drive serial number. To get that goto to Storage>Disks and find the HDD Serial Number for da1.

Open your server and look at the drives to find the drive the OS identifies as da1.

Use the GUI to replace that drive.

ianch · Jan 30, 2022

You could be right and I am clearly missing something.

The GUI is showing disk4 degraded. see below

Editing disk4 shows the below serial number

That serial number matches the disk4 in the smart information. The smart information says no errors.

Editing the Disk1 shows the below serial number, which likewise matches the Smart information with errors.

Below is the Storage->Disk information

Any ideas???

ianch · Jan 30, 2022

In addition to my last message

If I power down and remove what I believe is disk1 and power up da1 shows a missing.

Likewise the same for disk4 is da4.

So do I once again replace da4 because TrusNAS is showing that disk as degraded, and then once resilvering is complete then change da1 which is reporting SMART errors?

Alecmascot · Jan 30, 2022

To be sure use the serial number from the Smart report because the "daX" s can change on every reboot.

ianch · Jan 30, 2022

So which disk do I change?

da1 points to a disk with smart errors but looks good in truenas.

da4 points to a disk with no smart errors but truenas shows it as degraded.

I've double checked the WD serials against the smart reports and checked they do correspond to the daX I think it should be.

To be honest, I've now lost confidence in both disks and will swap them out, but which first?

I'm guessing da4 as that's what's reporting as degraded, then da1.

I guess worst case I lose the pool, but I've got backups.

jgreco · Jan 30, 2022

Trust serial numbers and disk errors reported by smartctl. Replace failing disks. Also suggest filing a Jira bug before you do so with a debug attached, so someone at iX can look to see if there's some presentational issue in the GUI.

sretalla · Feb 1, 2022

ianch said:
smartctl -a /dev/da4

SMART Self-test log structure revision number 1Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 187 -
# 2 Short offline Completed without error 00% 165 -

Just to clarify for you, this doesn't indicate there are no issues with that disk...

A completed test doesn't imply no errors/problems found with the disk, just that the test was able to complete in full.

For da1, the test couldn't complete in full, so you see a result to match.

smartctl -A /dev/da4 would show if there are really some things to worry about or not.

Important Announcement for the TrueNAS Community.

Pool degraded on Disk4 but SMART says Disk1

ianch

Cadet

NugentS

MVP

ianch

Cadet

Redcoat

MVP

ianch

Cadet

ianch

Cadet

Alecmascot

Guru

ianch

Cadet

jgreco

Resident Grinch

sretalla

Powered by Neutrality

Similar threads

Important Announcement for the TrueNAS Community.

Pool degraded on Disk4 but SMART says Disk1

Cadet

MVP

Cadet

MVP

Cadet

Cadet

Guru

Cadet

Resident Grinch

Powered by Neutrality

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool degraded on Disk4 but SMART says Disk1"

Similar threads