disk pool throwing alert

georgelza · Jul 29, 2023

New alerts:

Device: /dev/sde [SAT], 16 Currently unreadable (pending) sectors.

Current alerts:

Device: /dev/sdd [SAT], 24 Currently unreadable (pending) sectors.
Device: /dev/sdd [SAT], 24 Offline uncorrectable sectors.
Device: /dev/sde [SAT], 16 Currently unreadable (pending) sectors.

I get these (always 16 and 24 on the 2 disks)... they then disappear again, whats the best course action...

G

Davvo · Jul 29, 2023

Replace those drives using the WebUI, one after the other.

georgelza · Jul 30, 2023

Davvo said:
Replace those drives using the WebUI, one after the other.

See signature
G

georgelza · Jul 30, 2023

is there a way to tag the affected blocks, make them not to be used.

G

NugentS · Jul 30, 2023

Yes.

remove disk from pool - I would reccomend using a spare disk to replace one of the suspect disks
Run badblocks destructively across the disk - any failures should be mapped out by the disk
put disk back in pool and resilver

Davvo · Jul 30, 2023

georgelza said:
See signature
G

What should I look?

georgelza · Jul 31, 2023

you as

Davvo said:
What should I look?

you asked for my system config... I implied look at my signature as it is all there...

TrueNAS-SCALE-22.12.3.2
Gigabyte+GA-B250M-D3H,
CPU: Intel I5 7400 Quad Core 3.0Ghz,
Memory: 2 x Crucial DDR4, 8GB RAM (2400), 2 x Kingston DDR4, 8GB RAM (2400), => 32GB RAM
PSU: Corsair-rmx-cp-9020090-550w,
Case: Fractal Design, Node 804
Storage:
6 x 4TB Seagate Ironwolf HDD's
1 x ADATA 128GB M.2 SSD
Currently configured as:
Pool: Tank: 3 x 4TB - Raidz1 - onboard SATA controller, (video media, music, photos)
Pool: Bunker: 3 x 4TB - Raidz1 - IBM ServeRAID M1015 (in LSI9211-8i in IT mode), (documents, software & TimeMachine)
Pool: Apps: 2 x 450GB SSD's (ADATA SU630), hosts Plex, Unifi Controller etc
TrueNAS installed on 128GB M.2 NVME Transcend PCIe SSD 110S Gen3 x4

G

georgelza · Jul 31, 2023

hmm, if I had a spare disk I could just have replaced one of the faulty drives with said spare disk...

at this stage my disk pool would now be degraded... as it went from 3 wide RaidZ1 to 2 wide...
how do I now run bad blocks on the disk removed to mark those blocks bad, don't try and use them (as per my question).

?
G

NugentS said:
Yes.

remove disk from pool - I would reccomend using a spare disk to replace one of the suspect disks

Run badblocks destructively across the disk - any failures should be mapped out by the disk

put disk back in pool and resilver

samarium · Jul 31, 2023

FYI last time I ran badblocks over an old WD 4TB it took 3 days or so, but gave the desired result SMART 197 Currently Pending Sectors = 0

NugentS · Jul 31, 2023

badblocks(8) - Linux manual page

NugentS · Jul 31, 2023

There is nothing quite like it in windows - HDDScan goes someway to wards it - just not nearly as thorough.

Warnings:
1. This will take days, not hours. On a larger disk it can take weeks
2. This will really stress your disk, so if its in a state of almost failed - then it could well move to failed.

You can run it on the TrueNAS Scale machine - just make sure you run it on the correct disk - for obvious reasons. Remember to run it under tmux or screen to ensure it doesn't get closed down at the wrong time.

Davvo · Jul 31, 2023

I'd really make the replacement of the drive my first priority with a RAIDZ1 VDEV.

joeschmuck · Jul 31, 2023

You should post the SMART output for these two drives. Does a SMART Long test pass?

Normally when a drive fails a media test it's because the media is flaking off/damaged. While badblocks may map out the bad sectors, odds are other nearby sectors will start to fail. I will not say always, I have done this too on hard drives that were just out of warranty and I had nothing to lose. But if a SMART Long test fails after badblocks, dump the drive. Just an opinion, you do not have to take it at all.

NugentS · Jul 31, 2023

Yes - a better process is:

remove first faulty disk from pool
Replace disk using a spare - which you need to urgently obtain - and resilver
remove second faulty disk from pool
Replace disk using a spare - which you need to urgently obtain - and resilver

For each "faulty disk"

Run long smart test on drive (to get a baseline - if you don't have one)
1. Run badblocks destructively across the disk - any failures should be mapped out by the disk
2. run a long smart test on the disk
If either badblocks or second long smart test fail then RAM / discard the disk
Keep disk as spare as its passed the tests.

Oh and get some spare disks - you have two definately flakey - potentially (probably) faulty disks that you need to urgently replace and then try to recover which may or may not work

Note that if you have a spare SATA/Disk port then insert new disk and use replace to replace on of the duff disk with the new disk. This will maintain parity within the pool (if possible) whilst the resilver occurs. Once the resilver completes the replaced disk will be ejected from the pool (really cool if it could be ejected from the computer as well - but there are lots of issues with that idea) and you can proceed with the second replacement

Do you have a backup?

georgelza · Aug 3, 2023

so...
I have 2 diskpools.
Bunker and Tank.

Tank without warning early last week went from 100% no errors, ever to a drive failed, kicked out of the disk pool, degraded from 3 wide RaidZ1 to 2wide... the Diskpool is based on 3 x 4TB Ironwolf HDD/SATA, I eventually ordered 3 x 8TB Ironwolf replacement drives for Tank.

Then on Wednesday, Bunker, which every now and then through some of the above errors started stepping up, throwing allow more errors, and it is throwing it from 2 HDD's, this pool is also based out of 3 x 4TB Ironwolf HDD/SATA.

I just received my first drive... and I'm thinking of maybe rather replacing one of the HDD's in Bunker. and then when next drive arrive similar, aka replacing/rebuilding this Diskpool onto the 8TB HDD's.

Then I will move the Datasets thats on Tank onto Bunker Diskpool, at which point I have all my datasets on Bunker, based on 8TB HDD's which ill give me enough space.

At this point all my 4TB will be unused, I can then see whats good, whats not and see if I need to buy one or 2 and then build a new disk pool out of the good HDD's.

At the moment, I'm actually more worried about Bunker than Tank.

Based on this I need to tell TrueNAS to "retire" on of those throwing error drives and then introduce the new 8TB I just received, followed by probably tomorrow a 2nd 8TB, expect the 3rd one on Monday.

Comment ?

@sretalla as you've been in the loop on the failure on Tank, your view, You send a link on disk replacement, but that was for a disk pool that was originally 3 wide, reduced to 2 wide based on failure, where I need to add a drive back into the pool, we're looking here at telling TrueNAS to eject one drive and replace with the new drive.

I will only do a dataset move once the Diskpool is completely re-homed onto the 3 x 8TB as I need the space, which then release the 2 good 4TB from Tank.

G

sretalla · Aug 3, 2023

georgelza said:
I need to add a drive back into the pool

I'm a little confused about your pool layout...

Can we start with zpool status Bunker Tank

georgelza · Aug 3, 2023

shit, just shut NAS down

I have 2 pools
Bunker, 3wide RaidZ1,
1 GOOD / dev/sda
1 HDD throwing errors /dev/sdd - SMART shows 0
1 HDD throwing errors /dev/sde - SMART shows 2

Tank, 3wide RaidZ1,
Degraded to 2 HDD's
/dev/sdg
/dev/sdh
/dev/sdf - EJECTED

See attached:

PoA.
As much as tank is degraded to 2 HDD's, it seem to be stable... where as Bunker is concerning,
Thinking is to replace the HDD's of Bunker first, then relocate the dataset that sits on Tank atm to Bunker, and then rebuild Tank with the good 4TB HDD's.

G

sretalla · Aug 3, 2023

georgelza said:
Tank, 3wide RaidZ1,
Degraded to 2 HDD's
/dev/sdg
/dev/sdh
/dev/sdf - EJECTED

That's not a pool of 2 disks, that's a pool of 3 with one missing (needs replacement).

Just follow the replacement process for the missing disk (should be an option in the GUI to replace and select the disk to use for that).

georgelza · Aug 3, 2023

NugentS said:
Yes - a better process is:

remove first faulty disk from pool

Replace disk using a spare - which you need to urgently obtain - and resilver

remove second faulty disk from pool

Replace disk using a spare - which you need to urgently obtain - and resilver

For each "faulty disk"

Run long smart test on drive (to get a baseline - if you don't have one)

Run badblocks destructively across the disk - any failures should be mapped out by the disk

run a long smart test on the disk

If either badblocks or second long smart test fail then RAM / discard the disk

Keep disk as spare as its passed the tests.

Oh and get some spare disks - you have two definately flakey - potentially (probably) faulty disks that you need to urgently replace and then try to recover which may or may not work

Note that if you have a spare SATA/Disk port then insert new disk and use replace to replace on of the duff disk with the new disk. This will maintain parity within the pool (if possible) whilst the resilver occurs. Once the resilver completes the replaced disk will be ejected from the pool (really cool if it could be ejected from the computer as well - but there are lots of issues with that idea) and you can proceed with the second replacement

Do you have a backup?

NugentS

This NAS is my backup... I have some of the documents/photos replicated to Google, but there is still a sh4t load here that I don't want to loose.
Also running a test on media that's going to take a week/2 is simply not viable.
G

georgelza · Aug 3, 2023

sretalla said:
That's not a pool of 2 disks, that's a pool of 3 with one missing (needs replacement).

Just follow the replacement process for the missing disk (should be an option in the GUI to replace and select the disk to use for that).

agree, correct. but thats Tank as mentioned, the one where the disk was originally ejected...

As per this thread, I now have Bunker thats throwing allot of warnings/alerts and have SMART failures on one disk, this disk group is actually more concerning for me... why I'm thinking replace the drives in here first, then move the dataset thats on Tank atm onto Bunker.
At that point I've released all my 4TB HDD's then, which then allows me to build a new Tank using the good drives and maybe one additional new one?

G

Important Announcement for the TrueNAS Community.

disk pool throwing alert

Patron

MVP

Patron

Patron

MVP

MVP

Patron

Patron

Contributor

MVP

MVP

MVP

Old Man

MVP

Patron

Powered by Neutrality

Patron

Attachments

Powered by Neutrality

Patron

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "disk pool throwing alert"

Similar threads