New Ryzen Build, migrated pools have file errors on newly written files

garpharm13

Cadet
Joined
Jan 31, 2023
Messages
6
Hello,

Long time user, first post. I just migrated 2 pools to a new Ryzen build. Import went fine, I configured new shares without incident. Ryzen 5 2600, 64gb RAM, Asus TUF X570 PRO motherboard. I had have read a few posts regarding Ryzen and some instabilities, so I have disabled erp-ready and global c-states. AMD Cool & Quiet doesn't seem to be listed in my BIOS. I am now also experience random reboots. I did attempted to delete the files with errors, then the Hex entries showed up under the listed file errors in their stead. (File names obsucured for privacy reasons. I will mention one of these pools was originally migrated from FreeNAS, but has been working in the previous TrueNAS system for some time. Any suggestions would be greatly appreciated, I'm at a loss and ready to put the old motherboard and CPU back in this system.


root@truenas[/]# zpool status -v
pool: FreeNAS1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 05:15:51 with 0 errors on Sun Jan 1 03:15:52 2023
config:

NAME STATE READ WRITE CKSUM
FreeNAS1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/21e8bdff-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 0
gptid/2b0ec511-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 0
gptid/342e8678-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 0
gptid/3d516fe5-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 0
gptid/4689ced1-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 0
gptid/47fcf0dd-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

FreeNAS1/FreeNAS1:<0x8755>
/mnt/FreeNAS1/FreeNAS1/NzbDownload/Main/nzbget.lock

pool: TrueNAS1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 15:31:31 with 0 errors on Sun Jan 22 13:31:39 2023
config:

NAME STATE READ WRITE CKSUM
TrueNAS1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/678239e6-70c8-11eb-a6f4-00259077b6da ONLINE 0 0 0
gptid/67d8315b-70c8-11eb-a6f4-00259077b6da ONLINE 0 0 0
gptid/6a8c0628-70c8-11eb-a6f4-00259077b6da ONLINE 0 0 0
gptid/6b485785-70c8-11eb-a6f4-00259077b6da ONLINE 0 0 0
gptid/6aeed4a8-70c8-11eb-a6f4-00259077b6da ONLINE 0 0 0
gptid/6d13de12-70c8-11eb-a6f4-00259077b6da ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

TrueNAS1/TrueNAS1:<0x8000>
/mnt/TrueNAS1/TrueNAS1/Media/Series/File123.mkv
TrueNAS1/TrueNAS1:<0x8002>
/mnt/TrueNAS1/TrueNAS1/Media/Movies/File124.mkv

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:01 with 0 errors on Sun Jan 29 03:45:01 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvd0p2 ONLINE 0 0 0
nvd1p2 ONLINE 0 0 0

errors: No known data errors
 

garpharm13

Cadet
Joined
Jan 31, 2023
Messages
6
Now pools are showing as degraded, with checksum errors on all drives. I find it hard to believe that all 12 drives are failing at once. Any help on why importing my pools into a ryzen build from my old xeon build would be greatly apprecated.

HBA compatibility problem with Ryzen boards? Any commands I can run to see why this is happening? If I can fix the underlying problem, can I fix the checksum errors?
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
You didn't mention what kind of HBA you have.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Perc 200 flashed to SAS2008 and Dell SAS3008

Errors showing up on all drives is often a symptom of inadequate cooling. A PERC H200 (I'm assuming) is designed to have airflow crossing it and blowing out the back, however, often the H200 comes with a solid PCIe slot cover. When the crappy little CPU on the PERC is not adequately cooled, the temperature can shoot up and it can become unreliable. Contemplate the airflow you have in your chassis, and whether the heatsink on the RoC on your PERC card is possibly getting insufficient airflow. It does not need to be a ton of air, but it does need SOME air. I have a number of machines where dislocation of a card is not a problem and I have simply added a well ventilated slot blank. Not any endorsement but something like


or the Supermicro ones. There are also specific vented brackets you can get for the card, which may be a better idea if there is any chance the card could come loose.
 

garpharm13

Cadet
Joined
Jan 31, 2023
Messages
6
Thank you. I have been using these cards for a few years in the previous build without issues. There is a fan on the side of the case, but was unplugged during the system upgrade. That fan is now working blowing directly on the cards.

If I can get the cards cool, is there any way to bring the pool back from the degraded state?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If the data has actually been written to the drives corrupted, then the only hope is that there is enough redundancy that it can be recovered that way.

If, on the other hand, the data was merely misreading and was being corrupted by the HBA during a read operation, it is most likely out on disk in a perfectly usable format, and once you start cooling the HBA, it is very likely to be just fine.

The real question is whether or not lots of stuff was written to the pool, especially stuff like metadata, atime updates, etc.
 

garpharm13

Cadet
Joined
Jan 31, 2023
Messages
6
Not a lot written to the pool. Will the degraded status resolve on it's own if the overheating card issue is resolved (corrupted by read operation only)?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The degraded status is likely due to errors detected in the data being read. You can try a "zpool clear" to see if it resolves.
 

garpharm13

Cadet
Joined
Jan 31, 2023
Messages
6
Somewhere along the way I have checksum errors now on one pool, no longer showing as degraded though. I 3d printed fan mounts for the cards and have a side case fan blowing directly on the cards. Pool status still shows and orange triangle, and am unable to clear it with zpool clear <pool>. Any way to clear errors or repair checksums? I'm not too worried about corrupt data, none of this data is critical.

truenas% zpool status -v
pool: FreeNAS1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 3.53M in 04:04:11 with 0 errors on Thu Feb 9 23:14:21 2023
config:

NAME STATE READ WRITE CKSUM
FreeNAS1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/21e8bdff-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 16
gptid/2b0ec511-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 18
gptid/342e8678-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 21
gptid/3d516fe5-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 18
gptid/4689ced1-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 17
gptid/47fcf0dd-fc4c-11e9-ba05-000c2955f9f8 ONLINE 0 0 23

errors: No known data errors
 
Top