disk pool throwing alert

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
New alerts:
  • Device: /dev/sde [SAT], 16 Currently unreadable (pending) sectors.
Current alerts:
  • Device: /dev/sdd [SAT], 24 Currently unreadable (pending) sectors.
  • Device: /dev/sdd [SAT], 24 Offline uncorrectable sectors.
  • Device: /dev/sde [SAT], 16 Currently unreadable (pending) sectors.
I get these (always 16 and 24 on the 2 disks)... they then disappear again, whats the best course action...

G
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

georgelza

Patron
Joined
Feb 24, 2021
Messages
417

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
is there a way to tag the affected blocks, make them not to be used.

G
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Yes.
  1. remove disk from pool - I would reccomend using a spare disk to replace one of the suspect disks
  2. Run badblocks destructively across the disk - any failures should be mapped out by the disk
  3. put disk back in pool and resilver
 

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
you as
What should I look?
you asked for my system config... I implied look at my signature as it is all there...

TrueNAS-SCALE-22.12.3.2
Gigabyte+GA-B250M-D3H,
CPU: Intel I5 7400 Quad Core 3.0Ghz,
Memory: 2 x Crucial DDR4, 8GB RAM (2400), 2 x Kingston DDR4, 8GB RAM (2400), => 32GB RAM
PSU: Corsair-rmx-cp-9020090-550w,
Case: Fractal Design, Node 804
Storage:
6 x 4TB Seagate Ironwolf HDD's
1 x ADATA 128GB M.2 SSD
Currently configured as:
Pool: Tank: 3 x 4TB - Raidz1 - onboard SATA controller, (video media, music, photos)
Pool: Bunker: 3 x 4TB - Raidz1 - IBM ServeRAID M1015 (in LSI9211-8i in IT mode), (documents, software & TimeMachine)
Pool: Apps: 2 x 450GB SSD's (ADATA SU630), hosts Plex, Unifi Controller etc
TrueNAS installed on 128GB M.2 NVME Transcend PCIe SSD 110S Gen3 x4

G
 

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
hmm, if I had a spare disk I could just have replaced one of the faulty drives with said spare disk...

at this stage my disk pool would now be degraded... as it went from 3 wide RaidZ1 to 2 wide...
how do I now run bad blocks on the disk removed to mark those blocks bad, don't try and use them (as per my question).

?
G

Yes.
  1. remove disk from pool - I would reccomend using a spare disk to replace one of the suspect disks
  2. Run badblocks destructively across the disk - any failures should be mapped out by the disk
  3. put disk back in pool and resilver
 

samarium

Contributor
Joined
Apr 8, 2023
Messages
192
FYI last time I ran badblocks over an old WD 4TB it took 3 days or so, but gave the desired result SMART 197 Currently Pending Sectors = 0
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
There is nothing quite like it in windows - HDDScan goes someway to wards it - just not nearly as thorough.

Warnings:
1. This will take days, not hours. On a larger disk it can take weeks
2. This will really stress your disk, so if its in a state of almost failed - then it could well move to failed.

You can run it on the TrueNAS Scale machine - just make sure you run it on the correct disk - for obvious reasons. Remember to run it under tmux or screen to ensure it doesn't get closed down at the wrong time.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I'd really make the replacement of the drive my first priority with a RAIDZ1 VDEV.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You should post the SMART output for these two drives. Does a SMART Long test pass?

Normally when a drive fails a media test it's because the media is flaking off/damaged. While badblocks may map out the bad sectors, odds are other nearby sectors will start to fail. I will not say always, I have done this too on hard drives that were just out of warranty and I had nothing to lose. But if a SMART Long test fails after badblocks, dump the drive. Just an opinion, you do not have to take it at all.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Yes - a better process is:
  1. remove first faulty disk from pool
  2. Replace disk using a spare - which you need to urgently obtain - and resilver
  3. remove second faulty disk from pool
  4. Replace disk using a spare - which you need to urgently obtain - and resilver
For each "faulty disk"
  1. Run long smart test on drive (to get a baseline - if you don't have one)
    1. Run badblocks destructively across the disk - any failures should be mapped out by the disk
    2. run a long smart test on the disk
  2. If either badblocks or second long smart test fail then RAM / discard the disk
  3. Keep disk as spare as its passed the tests.
Oh and get some spare disks - you have two definately flakey - potentially (probably) faulty disks that you need to urgently replace and then try to recover which may or may not work

Note that if you have a spare SATA/Disk port then insert new disk and use replace to replace on of the duff disk with the new disk. This will maintain parity within the pool (if possible) whilst the resilver occurs. Once the resilver completes the replaced disk will be ejected from the pool (really cool if it could be ejected from the computer as well - but there are lots of issues with that idea) and you can proceed with the second replacement

Do you have a backup?
 

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
so...
I have 2 diskpools.
Bunker and Tank.

Tank without warning early last week went from 100% no errors, ever to a drive failed, kicked out of the disk pool, degraded from 3 wide RaidZ1 to 2wide... the Diskpool is based on 3 x 4TB Ironwolf HDD/SATA, I eventually ordered 3 x 8TB Ironwolf replacement drives for Tank.

Then on Wednesday, Bunker, which every now and then through some of the above errors started stepping up, throwing allow more errors, and it is throwing it from 2 HDD's, this pool is also based out of 3 x 4TB Ironwolf HDD/SATA.

I just received my first drive... and I'm thinking of maybe rather replacing one of the HDD's in Bunker. and then when next drive arrive similar, aka replacing/rebuilding this Diskpool onto the 8TB HDD's.

Then I will move the Datasets thats on Tank onto Bunker Diskpool, at which point I have all my datasets on Bunker, based on 8TB HDD's which ill give me enough space.

At this point all my 4TB will be unused, I can then see whats good, whats not and see if I need to buy one or 2 and then build a new disk pool out of the good HDD's.

At the moment, I'm actually more worried about Bunker than Tank.

Based on this I need to tell TrueNAS to "retire" on of those throwing error drives and then introduce the new 8TB I just received, followed by probably tomorrow a 2nd 8TB, expect the 3rd one on Monday.

Comment ?

@sretalla as you've been in the loop on the failure on Tank, your view, You send a link on disk replacement, but that was for a disk pool that was originally 3 wide, reduced to 2 wide based on failure, where I need to add a drive back into the pool, we're looking here at telling TrueNAS to eject one drive and replace with the new drive.

I will only do a dataset move once the Diskpool is completely re-homed onto the 3 x 8TB as I need the space, which then release the 2 good 4TB from Tank.

G
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
I need to add a drive back into the pool
I'm a little confused about your pool layout...

Can we start with zpool status Bunker Tank
 

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
shit, just shut NAS down

I have 2 pools
Bunker, 3wide RaidZ1,
1 GOOD / dev/sda
1 HDD throwing errors /dev/sdd - SMART shows 0
1 HDD throwing errors /dev/sde - SMART shows 2

Tank, 3wide RaidZ1,
Degraded to 2 HDD's
/dev/sdg
/dev/sdh
/dev/sdf - EJECTED

See attached:

PoA.
As much as tank is degraded to 2 HDD's, it seem to be stable... where as Bunker is concerning,
Thinking is to replace the HDD's of Bunker first, then relocate the dataset that sits on Tank atm to Bunker, and then rebuild Tank with the good 4TB HDD's.

G
 

Attachments

  • Screenshot 2023-08-03 at 12.12.24.png
    Screenshot 2023-08-03 at 12.12.24.png
    203.1 KB · Views: 55

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
Tank, 3wide RaidZ1,
Degraded to 2 HDD's
/dev/sdg
/dev/sdh
/dev/sdf - EJECTED
That's not a pool of 2 disks, that's a pool of 3 with one missing (needs replacement).

Just follow the replacement process for the missing disk (should be an option in the GUI to replace and select the disk to use for that).
 

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
Yes - a better process is:
  1. remove first faulty disk from pool
  2. Replace disk using a spare - which you need to urgently obtain - and resilver
  3. remove second faulty disk from pool
  4. Replace disk using a spare - which you need to urgently obtain - and resilver
For each "faulty disk"
  1. Run long smart test on drive (to get a baseline - if you don't have one)
    1. Run badblocks destructively across the disk - any failures should be mapped out by the disk
    2. run a long smart test on the disk
  2. If either badblocks or second long smart test fail then RAM / discard the disk
  3. Keep disk as spare as its passed the tests.
Oh and get some spare disks - you have two definately flakey - potentially (probably) faulty disks that you need to urgently replace and then try to recover which may or may not work

Note that if you have a spare SATA/Disk port then insert new disk and use replace to replace on of the duff disk with the new disk. This will maintain parity within the pool (if possible) whilst the resilver occurs. Once the resilver completes the replaced disk will be ejected from the pool (really cool if it could be ejected from the computer as well - but there are lots of issues with that idea) and you can proceed with the second replacement

Do you have a backup?
NugentS

This NAS is my backup... I have some of the documents/photos replicated to Google, but there is still a sh4t load here that I don't want to loose.
Also running a test on media that's going to take a week/2 is simply not viable.
G
 

georgelza

Patron
Joined
Feb 24, 2021
Messages
417
That's not a pool of 2 disks, that's a pool of 3 with one missing (needs replacement).

Just follow the replacement process for the missing disk (should be an option in the GUI to replace and select the disk to use for that).
agree, correct. but thats Tank as mentioned, the one where the disk was originally ejected...

As per this thread, I now have Bunker thats throwing allot of warnings/alerts and have SMART failures on one disk, this disk group is actually more concerning for me... why I'm thinking replace the drives in here first, then move the dataset thats on Tank atm onto Bunker.
At that point I've released all my 4TB HDD's then, which then allows me to build a new Tank using the good drives and maybe one additional new one?

G
 
Top