Permanent errors have been detected in the following files

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
Hi everyone!

My system:
TrueNAS-SCALE-22.12.3.1
CPU - E5-2680 v3
RAM - 128 Gig non ecc (my asrock ecc supported motheboard died last month and gonna get a new one soon)
Disks - 10 units of the Toshiba MG04SCA60EA (6TB 7200 rpm sas disks) - 1 disk is still a WD red plus 4tb that i still need to replace
SAS Card - LSI 9300-16i IT MODE

I'm having a problem regarding a file that i cannot see in the shell or in winscp

The error is:

root@truenas[~]# zpool status -v HDD_Pool
pool: HDD_Pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilvered 157M in 00:00:06 with 0 errors on Sat Jul 22 20:59:08 2023
config:

NAME STATE READ WRITE CKSUM
HDD_Pool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
4328d9e3-8f7e-4d55-b28f-7a0a146ec02c ONLINE 0 0 46
8fd88e5e-1684-4ee0-ad97-10c8f2ab1038 ONLINE 0 0 46
5d55fb37-ea03-40d7-bac3-27f9670976d7 ONLINE 0 0 46
0a3cca80-6630-4221-8143-cc3f715d3188 ONLINE 0 0 46
f7913639-235e-4d45-b280-c70efbe6a631 ONLINE 0 0 46
raidz1-1 ONLINE 0 0 0
07b1ddd1-2de9-412c-ade9-829ae8036c14 ONLINE 0 0 0
64fbdf56-418a-4b77-bee7-838b11c4b90f ONLINE 0 0 0
21f5b275-84b8-40e8-8add-a50bb222e927 ONLINE 0 0 0
2a728731-9dc7-46bc-9920-4f7f31a421e7 ONLINE 0 0 0
3f34555f-f2ca-45f5-b517-b65f8c14ce10 ONLINE 0 0 0
cache
0328e86a-2234-4a15-a9c5-6a2f321576cc ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

HDD_Pool/downloads/qbit:<0x9ff0>
root@truenas[~]#


Now, the shell ll comand shows this:

root@truenas[/mnt/HDD_Pool/downloads]# ll
total 54
drwxrwx--- 5 apps 5 Apr 25 00:07 ./
drwxr-xr-x 9 root 9 Jun 21 18:49 ../
drwxrwx--- 2 apps 2 Apr 25 00:07 deluge/
drwxrwx--- 2 apps 2 Apr 25 00:07 nzb/
drwxrwxrwx 5 apps 5 Jun 25 17:51 qbit/
root@truenas[/mnt/HDD_Pool/downloads]#


After some digging i understand that its a metadata error, and the only fix is to start over..

Now, Thats a problem since i have 18TB of data...

Is there any other way? any suggestions?

Making a snapsshot now is a smart thing to do? the last snapshot will not recover much since i just filled alot of data this week..

Thanks in advanced!
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Anything in the "HDD_Pool/downloads/qbit" path could be faulty. Except for going down the rabbit hole and checking out ZFS debug utility, I don't know of a way to find what the faulty info is.

This is somewhat strange because Metadata by default is redundant, regardless of any vDev redundancy. Meaning your Metadata for "HDD_Pool/downloads/qbit" should have been stored on both "raidz1-0" AND "raidz1-0". ZFS attempts to store Metadata's second copy on another vDev than the first, (if possible).

If it were my pool, I would:
  • Run a scrub.
  • If the problem still exists, make sure I have a backup of "HDD_Pool/downloads/qbit"
  • Wipe that path out
  • Re-run scrub to make sure problem is gone
  • If gone, restore that path from backups
But, before you go and do that, wait and see if someone else has a better suggestion!
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Some years ago I had a similar result where <0xNNN> numbers would appear as the result of errors reported by ZFS.
As far as I understand, hex numbers would represent something like a pointer used by iocage jails to link to shares/mounted datasets/files.
So I would look into the iocage jail that use the qbit dataset to find the error.
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
@gilgedje
Looking at your zpool status - there are lots of chksum errors in vdev1 and none on vdev2
chksum errors are often (but not always) caused by a dodgy cable, maybe not seated properly. Given that you have an LSI card and 4 disks are reporting errors I would be suspicious of the cable connecting those 4 drives (I am assuming that the same cable is in use).

Some more clarity about how the drives are connected to the NAS would be useful.
 

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
@gilgedje
Looking at your zpool status - there are lots of chksum errors in vdev1 and none on vdev2
chksum errors are often (but not always) caused by a dodgy cable, maybe not seated properly. Given that you have an LSI card and 4 disks are reporting errors I would be suspicious of the cable connecting those 4 drives (I am assuming that the same cable is in use).

Some more clarity about how the drives are connected to the NAS would be useful.
So... actually about that..

I started with 5 WD 4tb disks - needed more storage, so i bought 10 new disks of the toshiba

Added another vdev with 5 of the 6tb toshiba drives and then stated taking out disks from the original raid (one left from the original 4tb as u can see)

So does it mean that the problem occured during or before i added the extra vdev to the pool?

If i will remove all the data from before the vdev was added - will that resolve the problem?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
You have an issue with those 4 disks I think. Or more likley the cable attaching them (assuming its the same cable)

"Some more clarity about how the drives are connected to the NAS would be useful." is what I wrote - you didn't answer - so I am forced to make assumptions.

Is it the same cable connecting the disks?
 

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
You have an issue with those 4 disks I think. Or more likley the cable attaching them (assuming its the same cable)

"Some more clarity about how the drives are connected to the NAS would be useful." is what I wrote - you didn't answer - so I am forced to make assumptions.

Is it the same cable connecting the disks?
Same SAS
switched cabling 3 times with brand new cables - probably not that..

and i have errors on all the 5 disks - the 4 toshiba and 1 WD

whay would i have a problem with 4 drives if all of them shows errors?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
OK - let me ask in a more clear manner.

How is each disk attached to the NAS - all the disks, every single one. Which cables are common to which disk. That LSI is a -16i so you could have 4 drives attached to each cable with 4 cables = 16 drives (without SAS Expanders). Be careful to note which disks are generating errors

Also - what case is this in - how hot does the main chip on the LSI get under use, use the touch test, does it get plenty of airflow? If its too hot to touch, its too hot (which may be the cause of issues).

Also - are the chksum errors increasing or not. If not then you may have fixed the issue already. Run a zpool clear and watch for new errors.

Basically if multiple disks are generating errors, then you need to look upstream (cables and HBA)
 
Last edited:

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
OK - let me ask in a more clear manner.

How is each disk attached to the NAS - all the disks, every single one. Which cables are common to which disk. That LSI is a -16i so you could have 4 drives attached to each cable with 4 cables = 16 drives (without SAS Expanders). Be careful to note which disks are generating errors

Also - what case is this in - how hot does the main chip on the LSI get under use, use the touch test, does it get plenty of airflow? If its too hot to touch, its too hot (which may be the cause of issues).

Also - are the chksum errors increasing or not. If not then you may have fixed the issue already. Run a zpool clear and watch for new errors.

Basically if multiple disks are generating errors, then you need to look upstream (cables and HBA)
Now that u mention it, that number of checksum erros dosent increase no more
Ii might have already repaired itself
should i run a scrub then?

as to how all is connected
1) the case im useing is the Istarusa D-400 4U case
2) all the drives sit in a BPN-DE350HD drive cages (i have 2 of those 10 total drives)
2.1) i changed the cooling of those which isnt enogh still and i use a fan to help cooling which helps alot and i dont get temp errors enymore
2.2) i have also came up with a permanent solution of this - so no worrys
3) the lsi 16i runs 3 ports and outputs 12 sata ports and 10 of them are connected to the drive cage, also no other drive is connected to this sas controller

lastly, the sas controller runs a little hot but not that much, i will add a noctua fan to it in order to maintain it cooler

thanks for your help!
hope i answered everything!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
A chksum error is a detected but fixed error.

Run a zpool clear to clear any errors, and then, after all error counters are zero'd, run a scrub - the scrub will test the pool and you can watch for errors increasing - hopefully they won't.

Comments: The case in question is designed for less drives than you have so you are pushing more heat in than the case is potentially designed for. Most large rack mounted cases have a somewhat noisy fanwall across the case that pulls air through the drives and through the case. You do not have a fan wall so probably have insufficient air through the case. I note that you have added anther fan and don't get temp warnings any more - but you may still be short on airflow.
[Yeah, I know - its all may and probably]

I suggest running as a cron job on a regular basis, @joeschmuck's multi-report script which will give you a lot of details about the state of your disks. Keep an eye on temperatures (MIN MAX and Actual)

I also hope you are running:
  1. Regular Scrubs
  2. Regular short and long smart tests
 

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
A chksum error is a detected but fixed error.

Run a zpool clear to clear any errors, and then, after all error counters are zero'd, run a scrub - the scrub will test the pool and you can watch for errors increasing - hopefully they won't.

Comments: The case in question is designed for less drives than you have so you are pushing more heat in than the case is potentially designed for. Most large rack mounted cases have a somewhat noisy fanwall across the case that pulls air through the drives and through the case. You do not have a fan wall so probably have insufficient air through the case. I note that you have added anther fan and don't get temp warnings any more - but you may still be short on airflow.
[Yeah, I know - its all may and probably]

I suggest running as a cron job on a regular basis, @joeschmuck's multi-report script which will give you a lot of details about the state of your disks. Keep an eye on temperatures (MIN MAX and Actual)

I also hope you are running:
  1. Regular Scrubs
  2. Regular short and long smart tests
Ok so..

after clearing the errors and running the scrubs i have som errors still
same 5 disks have them and its the same error as shown before
here is the output:

root@truenas[~]# zpool status -v HDD_Pool
pool: HDD_Pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 1 days 21:04:11 with 3 errors on Wed Jul 26 15:39:39 2023
config:

NAME STATE READ WRITE CKSUM
HDD_Pool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
4328d9e3-8f7e-4d55-b28f-7a0a146ec02c ONLINE 0 0 6
8fd88e5e-1684-4ee0-ad97-10c8f2ab1038 ONLINE 0 0 6
5d55fb37-ea03-40d7-bac3-27f9670976d7 ONLINE 0 0 6
0a3cca80-6630-4221-8143-cc3f715d3188 ONLINE 0 0 6
f7913639-235e-4d45-b280-c70efbe6a631 ONLINE 0 0 6
raidz1-1 ONLINE 0 0 0
07b1ddd1-2de9-412c-ade9-829ae8036c14 ONLINE 0 0 0
64fbdf56-418a-4b77-bee7-838b11c4b90f ONLINE 0 0 0
21f5b275-84b8-40e8-8add-a50bb222e927 ONLINE 0 0 0
2a728731-9dc7-46bc-9920-4f7f31a421e7 ONLINE 0 0 0
3f34555f-f2ca-45f5-b517-b65f8c14ce10 ONLINE 0 0 0
cache
0328e86a-2234-4a15-a9c5-6a2f321576cc ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

HDD_Pool/downloads/qbit:<0x9ff0>


any suggestion regarding what should be my next move?
trying to save the data since i have about 18 tb..

regarding the case has more heat or disks then it should - i know but i do what i can in order to keep it cool, and i made some mods to the case and the hdd bays in order to do so and it is working fine
i will add an active fan to the sas controller in order to keep it as cool as possible (saw that its common to do so)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
OK - that is a metadata error, usually meaning you have to destroy the pool and start again. However in this case simply deleting the HDD_Pool/downloads/qbit - or maybe the whole downloads dataset may fix that issue.

Lets look at what the common parts are that are causing the issue:
Motherboard - no reason to think thats an issue
LSI Card - for the moment I don't think thats the issue - but more cooling is good
SAS / SATA breakout cable. The issue is across 2 of them, but not all ports - so I don't think thats it

So we are down to the BPN-DE350HD.

Is the following possible?
1. Power down
2. Remove all disks from the effected BPN-DE350HD
3. Attach the disks via the SAS/SATA breakout cable and SATA power connectors, direct to the PSU , each on their own power connector
4. Put the disks on top of the case
5. Get a big fan (house style) and point it at the disks - so they get some airflow
6. Power Up
7. Make sure the pool has appeared
8. Zpool clear & scrub

See what happens

Question - what PSU do you have?

My current thought (educated guess) is that 2 SATA power connectors to 5 3.5" HDD's is borderline not enough (those connectors are not nearly as good as molex) - so an issue inside the cage MIGHT be causing the problem - so eliminate the cage and test.

Other comment:
I am guessing you don't have a backup - Que the Chorus Line (consisting of @jgreco @Davvo @winnielinnie @Arwen @danb35 and @joeschmuck amongst others)
"RAID is not a backup"
achorusline0981.jpg

Thats @jgreco at the front - I shall let the others keep their annonomity.
 
Last edited:

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
OK - that is a metadata error, usually meaning you have to destroy the pool and start again. However in this case simply deleting the HDD_Pool/downloads/qbit - or maybe the whole downloads dataset may fix that issue.

Lets look at what the common parts are that are causing the issue:
Motherboard - no reason to think thats an issue
LSI Card - for the moment I don't think thats the issue - but more cooling is good
SAS / SATA breakout cable. The issue is across 2 of them, but not all ports - so I don't think thats it

So we are down to the BPN-DE350HD.

Is the following possible?
1. Power down
2. Remove all disks from the effected BPN-DE350HD
3. Attach the disks via the SAS/SATA breakout cable and SATA power connectors, direct to the PSU , each on their own power connector
4. Put the disks on top of the case
5. Get a big fan (house style) and point it at the disks - so they get some airflow
6. Power Up
7. Make sure the pool has appeared
8. Zpool clear & scrub

See what happens

Question - what PSU do you have?

My current thought (educated guess) is that 2 SATA power connectors to 5 3.5" HDD's is borderline not enough (those connectors are not nearly as good as molex) - so an issue inside the cage MIGHT be causing the problem - so eliminate the cage and test.

Other comment:
I am guessing you don't have a backup - Que the Chorus Line (consisting of @jgreco @Davvo @winnielinnie @Arwen @danb35 and @joeschmuck amongst others)
"RAID is not a backup"
achorusline0981.jpg

Thats @jgreco at the front - I shall let the others keep their annonomity.
ok so

i dont have the right cables to do a stright connection to the sas drives cause im using a drive bay

but i can try and move the disks between the drive bays

the 5 disks that dont show problems move to the bay where the drives with the errors are at and those that have errors to where the no errors disks were

so then we can say if the problem is the drive bay or not.. am i right?

the PSU is a 650W i have enough headroom im my opinion

what u think should we try and replace the drives location?
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
that works
 

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
that works
OK SO....
IT WORKS! everything is great!
seems that you actually caught the problem
i actually moded my entire case adding multiple fans
i switched the fans that come with the cages, added 2 fron pans just right before the HDD cage to intake air
and added fan just above the pcie lanes and on the SAS card
scrub ran last night with no issues
i dont get any more temp issues
and everything is actually woking great
(in the end i didnt need to change the drives places - i kinda messed up with the power from PSU and shared the boot drive with the 5 bay that had a problem - removing that probably helped too - bu i think its the SAS controller that was too warm)
i thanks you so much for your suggestions and help
 

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
Some pics for referance
i used 12x15 noctua fans in front and moded the panel a bit for them to fit adding holes in the process
and another fan where the pcie lans are to remove heat better
also switched the cages fans with larger noctua fans



WhatsApp Image 2023-08-03 at 18.02.01.jpeg
WhatsApp Image 2023-08-03 at 18.02.03.jpeg
WhatsApp Image 2023-08-03 at 18.16.33.jpeg
WhatsApp Image 2023-08-03 at 18.16.48.jpeg
 

gilgedje

Dabbler
Joined
Jun 14, 2023
Messages
14
A chksum error is a detected but fixed error.

Run a zpool clear to clear any errors, and then, after all error counters are zero'd, run a scrub - the scrub will test the pool and you can watch for errors increasing - hopefully they won't.

Comments: The case in question is designed for less drives than you have so you are pushing more heat in than the case is potentially designed for. Most large rack mounted cases have a somewhat noisy fanwall across the case that pulls air through the drives and through the case. You do not have a fan wall so probably have insufficient air through the case. I note that you have added anther fan and don't get temp warnings any more - but you may still be short on airflow.
[Yeah, I know - its all may and probably]

I suggest running as a cron job on a regular basis, @joeschmuck's multi-report script which will give you a lot of details about the state of your disks. Keep an eye on temperatures (MIN MAX and Actual)

I also hope you are running:
  1. Regular Scrubs
  2. Regular short and long smart tests
another question regardind regular smart tests
i use short smart test usually and do manual long test every 3 days
i want to automate long tests too
what is your recommendation to scedual the tests? i have 10 disks in total - 2 diffrent raids of 5 in raid z1
Thanks in advanced!
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I do a short test every day, with a long test every two weeks
I also run @joeschmuck's multi-report script that emails me a daily report on disk status, and a copy of the config file
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@gilgedje I'm impressed that someone out there is willing to make case modifications, great job!

I do have some advice... The fan mounted to the case lid which exhausts the air out of the case, if that fan is over the PCIe cards, I'd flip the fan over to blow air down on the PCIe cards. This will not make your hard drive temps go up but it would keep the PCIe cards cooler. From an exhaust perspective you have two case fans already moving air out of the case and could have a third fan in the power supply, but you would need to flip the power supply over so the fan is facing the CPU. This just a thought and if you are happy the way things are, that is great.

As for SMART Tests, I run a Short SMART test every day at 00:05 and a Long SMART test once a week at 00:15. The short test takes less than 2 minutes so there is no conflict even if the Short test takes longer due to a failure. Everyone will have a different opinion and it's up to you to decide what you desire to do. Hey, I know people that do a Long test once a month. Not that I would.
 
Top