raidz - degraded - investiagion

phier

Patron
Joined
Dec 4, 2012
Messages
400
@Davvo thanks! will do this weekend.

also regardin to that calculation and UCE
A couple thoughts:
1. The uncorrectable error spec for the HC550 is 1 in 10^15, which gives you some head space for UCE, but I suspect most of us here would advise against RAIDZ1. You're looking at one UCE for roughly every 120Tb, but for 32Tb of data, that's pretty thin.
are we talking here about https://jro.io/r2c2/,
i did simulation for r5 3drives; r6 with 4 or 6 drives and here are results from the r2c2,
or am i mixing apples with pears?:)

1701285894965.png



My use-case would be:
NAS - not sure if r5 - 3 drives, or r6 - 4 drives
and daily replication to NAS2: also not sure what setup should be good for that r5?


In the article ZFS_Storage_Pool_Layout_White_Paper_February_2022.pdf they are pointing out that
r6 wide 6 (6 drives) is better then r6 wide 4 (4 drives), but based on the calculation above seems its r6-w4 is better...


1701286306358.png





Thanks
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
@Davvo thanks! will do this weekend.

also regardin to that calculation and UCE

are we talking here about https://jro.io/r2c2/,
i did simulation for r5 3drives; r6 with 4 or 6 drives and here are results from the r2c2,
or am i mixing apples with pears?:)

Thanks

A quick casual read.... I suspect that calculator is addressing complete drive failure leading to pool failure. I was pointing out the odds of a single un-correctable error per drive, which at 1 x 10^15 is about every 120Tb by my dinner napkin math. Given you have 32Tb of data, you have a margin of about 3.75 footprints. The problem only occurs when two drives experience an error in the same lba, and ZFS RAIDZ1 being only single fault tolerant then can't recover your data to map out the failure. You then lose that file or zvol, but probably not the entire pool, though I don't think that can be excluded. So... no, we're mixing apples with pears.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Please read the following resource:
 

phier

Patron
Joined
Dec 4, 2012
Messages
400
  1. Check power and data connections for evident damage or not properly latched cables;
  2. Run zpool clear universalsoldier and then run a scrub;
  3. After the scrub finishes, check if any errors were found;

Hello @Davvo, finally get phy access so did 1)2)3);

even there was no phy access 1) that notification:
* Pool universalsoldier state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

  • Disk ST18000NM000J-2TV103 ZR52K9D5 is REMOVED
Dissappeared long time ago. I have no clue why, and i am not clear if its good or bad sign.


New results are here:
pool: universalsoldier
state: ONLINE
scan: scrub repaired 0B in 1 days 02:25:01 with 0 errors on Wed Dec 20 03:25:06 2023
config:

NAME STATE READ WRITE CKSUM
universalsoldier ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gptid/a691e97e-ffaa-11ec-aba1-000c29f60559 ONLINE 0 0 0
gptid/a68cf3cc-ffaa-11ec-aba1-000c29f60559 ONLINE 0 0 0
gptid/a696a973-ffaa-11ec-aba1-000c29f60559 ONLINE 0 0 0

errors: No known data errors

Previously it showed -> scan: resilvered 133G in 00:57:28 with 0 errors on Sun Nov 19 07:06:19 2023,
so i hope these 133G was "corrected" and there is no data corruption.



Smartctl long test results attached also, but again test for a drive that was previously removed multiple times:
  • Disk ST18000NM000J-2TV103 ZR52K9D5 is REMOVED
was Interupted again :( I am going to re-run long test.
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 00% 11629 -




Based on the provided articles; i think it might be better to go with Z2 configuration with 4x18TB
instead of my current setup ie 3x18tb (Z1). Hope my assumption is correct and its not an overkill.

Thanks a lot
 

Attachments

  • ada0_050124.txt
    5.1 KB · Views: 133
  • ada1_050124.txt
    5 KB · Views: 129
  • ada2_050124.txt
    5.9 KB · Views: 149

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I wouldn't consider a 4-wide RAIDZ2 VDEV overkill in a remote system; some folks use it in standard "at hand" systems too.
 

phier

Patron
Joined
Dec 4, 2012
Messages
400
I wouldn't consider a 4-wide RAIDZ2 VDEV overkill in a remote system; some folks use it in standard "at hand" systems too.
@Davvo got it, regardin the issue with the drive,
interuptions during the smartctl long test, removal of the drive from the pool ...
its not possible to say anything?

thanks
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Well I'm not clear about your hardware and how you are running the tests, so I won't comment much about it beyond suggesting to check the cables.
 

phier

Patron
Joined
Dec 4, 2012
Messages
400
Well I'm not clear about your hardware and how you are running the tests, so I won't comment much about it beyond suggesting to check the cables.
@Davvo supermicro board and sata ports attached directly to drives.

By how do i run tests - what do you mean? i execute smartct -t long /dev/adaX directly from truenas via ssh

Running Esxi and truenas as a VM, but supermicro native sata controller is passthrought to Truenas VM.


I can try to replace data and power cable but its sounds strange, that all that stopped without any manual intervention. I assume in case the cable is faulty it should repeat etc.
thanks
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
If you do not provide a complete hardware list as per forum's rules, which was required a few times, you are not helping US help you.
Same goes for not making immediately clear this is a virtualized istance.

Troubleshooting a virtualized system requires specialized knowledge usually not found in "common" users.

To future readers:
DO SPECIFY RIGHT AWAY IF YOU ARE VIRTUALIZING: NOT DOING SO MIGHT RESULT IN DATA LOSS!
 
Last edited:

phier

Patron
Joined
Dec 4, 2012
Messages
400
If you do not provide a complete hardware list as per forum's rules, which was required a few times, you are not helping US help you.
Same goes for not making immediately clear this is a virtualized istance.

Troubleshooting a virtualized system requires specialized knowledge usually not found in "common" users.

To future readers:
DO SPECIFY RIGHT AWAY IF YOU ARE VIRTUALIZING: NOT DOING SO MIGHT RESULT IN DATA LOSS!
why data loss? i am confused.

SETUP:
https://www.supermicro.com/en/products/motherboard/x11ssl-f
pentium processor
64gb ecc ram

ESXi, Truenas in VM (16gb ram);
x11ssl-f sata ports passthrought to VM;
3x 18TB as mentioned 2xwd, 1x seagate

and few VMs running on ESXi
ESXi is installed on ssd drive plugged to PCI (using reduction)

thanks
 

phier

Patron
Joined
Dec 4, 2012
Messages
400
smartctl long report from the problematic drive - seagate finished.


@Davvo do i need to provide more additional info?
thanks
 

Attachments

  • ada2_080124.txt
    6 KB · Views: 141

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
I have very little knowledge about virtualization and honestly little clue of what to do in order to troubleshoot the issue.
 

phier

Patron
Joined
Dec 4, 2012
Messages
400
hello @jgreco,
do u mind to chime in ... i mean is it possible somehow find out what supermicro with Vt-d are safe to be run with ESXi?

thanks!
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
All the other values are perfectly fine, I wouldn't be so sure to mark it as broken for command timeout errors.
 

rvassar

Guru
Joined
May 2, 2018
Messages
972
Just a thought... And I admit I haven't gone back to the first page to check, but... Have you checked your TLER settings for this drive?

 
Top