Jay Gardner
Cadet
- Joined
- Feb 24, 2014
- Messages
- 4
Truenas Core 12.0-U8
32 Gig RAM (originally, but now running on 16 after system wouldn't boot after running fine for 7 years, 16 Gig Crucial, 16Gig Kingston, both 1333, System running on Kingston currently).
Supermicro X10SL7-F
Xeon E3-1265L V4
8 x 3TB drives configured as 4 mirrored pairs (all Seagate NAS drives)
1 x 60Gig OCX-Vertex SSD Boot Drive
System has been running for 7 years - originally the CPU was a Core I3 4160, about 1.5 years ago, i bought the Xeon, used. The Boot SSD has been in use also for about 1.5 years, prior to that it was running on flash. drives have changed over time - smartctl seems to show the drive health on all drives are fine.
A few days ago, i found the system off. Upon trying to reboot, it wouldn't start with the boot code on the screen of 55 which i found indicated a memory issue. I began removing memory sticks until i found it would boot with memory installed only in channel A. I originally thought it was a board problem (& it may still be). But after some more work, i was able to get it to boot with all 4 sticks of RAM - Channel B seems to be more temperamental about which sticks would work in it - the Kingston memory didn't want to work in Channel B. I don't remember which memory was in which channel prior to these issues.
But after a few hours, the system was off again. So, i removed the memory from channel B and used the Crucial memory in channel A and this time while it booted, it began displaying memory errors on the console:
I also had same error for Device: 1 - so if i understand this correctly, this suggests both sticks of the crucial memory are bad - Correct?
So, then swapped out the Crucial memory for the Kingston Memory still in Channel A. Now the system stays up, but it is generating some other data corruption errors -on both the boot and data pools:
On the data pool, i've swapped out the referenced file.
QUESTION - with the error on the boot pool, do i essentially need to reinstall Truenas from scratch and import my data pools?
There was another error indicating a zfs replication task had failed - it completed partially. (I have a Truenas Scale system on another machine that i was replicating certain datasets to). The message indicates "A resuming stream can be generated on the sending system by running: zfs send -t a-really-long-string. But i can't figure out how to use this command. I tried using (from truenas core system):
If I can't get this to work, I suppose i can just rerun the snapshot, ignore/delete the original snapshot and rerun the replication, which is fine.
Finally, i'm trying to determine if i have a board problem or a CPU problem in addition to what looks now like a memory problem. To me the Crucial memory looks like it's gone bad given the above memory errors. I've ordered a replacement X10 board and i still have the I3-4160 CPU...
QUESTIONS
-Does it make sense to first swap out the Xeon for the Core I3 and see if i can get the system restarted with memory in both channel A and B before i recieve the new memory?
-If i still have problems using channel B on the X10 board with the Core I3 CPU, does it follow that the X10 board is likely problematic?
-Any other suggestions/guidance on what to do from here?
thanks in advance.
32 Gig RAM (originally, but now running on 16 after system wouldn't boot after running fine for 7 years, 16 Gig Crucial, 16Gig Kingston, both 1333, System running on Kingston currently).
Supermicro X10SL7-F
Xeon E3-1265L V4
8 x 3TB drives configured as 4 mirrored pairs (all Seagate NAS drives)
1 x 60Gig OCX-Vertex SSD Boot Drive
System has been running for 7 years - originally the CPU was a Core I3 4160, about 1.5 years ago, i bought the Xeon, used. The Boot SSD has been in use also for about 1.5 years, prior to that it was running on flash. drives have changed over time - smartctl seems to show the drive health on all drives are fine.
A few days ago, i found the system off. Upon trying to reboot, it wouldn't start with the boot code on the screen of 55 which i found indicated a memory issue. I began removing memory sticks until i found it would boot with memory installed only in channel A. I originally thought it was a board problem (& it may still be). But after some more work, i was able to get it to boot with all 4 sticks of RAM - Channel B seems to be more temperamental about which sticks would work in it - the Kingston memory didn't want to work in Channel B. I don't remember which memory was in which channel prior to these issues.
But after a few hours, the system was off again. So, i removed the memory from channel B and used the Crucial memory in channel A and this time while it booted, it began displaying memory errors on the console:
APEI Corrected Memory Error
Node: 0
Device: 0
Memory Error Type 2
I also had same error for Device: 1 - so if i understand this correctly, this suggests both sticks of the crucial memory are bad - Correct?
So, then swapped out the Crucial memory for the Kingston Memory still in Channel A. Now the system stays up, but it is generating some other data corruption errors -on both the boot and data pools:
root@freenas[~]# zpool status -v
pool: freenas-boot
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:01:07 with 0 errors on Thu Apr 14 03:46:07 2022
config:
NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada2p2 ONLINE 0 0 1
errors: Permanent errors have been detected in the following files:
/var/db/system/rrd-f62aea877c404489b2f9c112843f66c1/localhost/disk-ada0/disk_ops.rrd
pool: zfs3t
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 03:56:50 with 0 errors on Sun Apr 17 03:56:50 2022
config:
NAME STATE READ WRITE CKSUM
zfs3t ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/f6094f20-078a-11e4-90d0-6805ca1a8b59 ONLINE 0 0 0
gptid/c8edf3e5-ad98-11ea-a1a1-0cc47aaa6584 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/6052474d-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
gptid/60ee4b8b-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/9c0f128f-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
gptid/9cbd6a3c-8f5b-11e6-8fca-6805ca1a8b59 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/cbd30b4b-b042-11ea-b3f9-0cc47aaa6584 ONLINE 0 0 1
gptid/ccb738ed-b042-11ea-b3f9-0cc47aaa6584 ONLINE 0 0 1
errors: Permanent errors have been detected in the following files:
zfs3t/jftp@auto-2022-04-19_15-04:/jcam/LR/2022Y04M19D08H236764-video.mp4
On the data pool, i've swapped out the referenced file.
QUESTION - with the error on the boot pool, do i essentially need to reinstall Truenas from scratch and import my data pools?
There was another error indicating a zfs replication task had failed - it completed partially. (I have a Truenas Scale system on another machine that i was replicating certain datasets to). The message indicates "A resuming stream can be generated on the sending system by running: zfs send -t a-really-long-string. But i can't figure out how to use this command. I tried using (from truenas core system):
but it didn't work:zfs send -t "the really long string" | ssh root@mytruenas-scale-system -s name-of-dataset-that-receives-replication-from-core
QUESTION - am i using this command correctly?warning: cannot send 'zfs3t/jftp@auto-2022-04-19_15-04': Input/output error
cannot receive resume stream: checksum mismatch or incomplete stream.
If I can't get this to work, I suppose i can just rerun the snapshot, ignore/delete the original snapshot and rerun the replication, which is fine.
Finally, i'm trying to determine if i have a board problem or a CPU problem in addition to what looks now like a memory problem. To me the Crucial memory looks like it's gone bad given the above memory errors. I've ordered a replacement X10 board and i still have the I3-4160 CPU...
QUESTIONS
-Does it make sense to first swap out the Xeon for the Core I3 and see if i can get the system restarted with memory in both channel A and B before i recieve the new memory?
-If i still have problems using channel B on the X10 board with the Core I3 CPU, does it follow that the X10 board is likely problematic?
-Any other suggestions/guidance on what to do from here?
thanks in advance.