Pool degraded no SMART failure

Nighteyes · Nov 2, 2021

Since a week I have the following problem. My pool works fine for a few days and after a while it marks the pool as degraded telling me a disk is faulty.
The device in question shows up always with 1-10 write errors and 100-300 read errors. Each time it is a different disk; checked device id with smartctl.

The first two times I ran a long SMART test on the disk in question which it passed with no problems. Seeing as there are no problems I brought the disk offline/online, resilvered and did a scrub. The pool functions then with no problems for a day and then the same thing happens again.

Now 2 different with disks are showing this and I'm sure they are not faulty:

Code:

    NAME                                            STATE     READ WRITE CKSUM
    Volume1                                         DEGRADED     0     0     0
      raidz2-0                                      DEGRADED     0     0     0
        gptid/97216be1-177a-11e9-8c11-6805ca843b8a  ONLINE       0     0     0
        gptid/9806778a-177a-11e9-8c11-6805ca843b8a  ONLINE       0     0     0
        gptid/98dac192-177a-11e9-8c11-6805ca843b8a  ONLINE       0     0     0
        gptid/99b4a619-177a-11e9-8c11-6805ca843b8a  FAULTED     10   298     0  too many errors
        gptid/3b44fea7-23f9-11e9-9f54-6805ca843b8a  ONLINE       0     0     0
        gptid/9b68622e-177a-11e9-8c11-6805ca843b8a  FAULTED      8   160     0  too many errors

Code:

<INTEL SSDSCKJB150G7 N2010121>     at scbus2 target 0 lun 0 (pass0,ada0)
<ATA ST10000VN0004-1Z SC60>        at scbus4 target 6 lun 0 (pass1,da0)
<ATA ST10000VN0004-1Z SC60>        at scbus4 target 7 lun 0 (pass2,da1)
<ATA ST10000VN0004-1Z SC60>        at scbus4 target 9 lun 0 (pass3,da2)
<ATA ST10000VN0004-1Z SC60>        at scbus4 target 10 lun 0 (pass4,da3)
<ATA ST10000VN0004-1Z SC60>        at scbus4 target 11 lun 0 (pass5,da4)
<ATA ST10000VN0004-1Z SC60>        at scbus4 target 12 lun 0 (pass6,da5)

I'm running FreeNAS-11.3-U5.

This started happening after I went into my bios and disabled C6 state (since I got a Zen+ cpu) and enabled ECC (which was set to Auto before); before this it has been running for about 2 years without any problems. I had to turn C6 off because after the upgrade to 11.3-U5 my system would hang after running for a month with no logs indication if something went wrong, the only thing to do was put the power off..... I never had this problem before upgrade (I had the last version with the old UI).
Could it be that I lost data after having a hard crash (system would not react to anything but a ping)?

Before I start doing more things that could potentially harm my pool even more I want some feedback on how to proceed.My first thought was setting ECC back to Auto.
I did yesterday do the suggested pool update (suggested by Freenas).

I got 6 of the following disks:
Seagate IronWolf - ST10000VN0004

Rest of the system:
AMD Ryzen 5 2600
ASRock B450M Pro4
IBM ServeRAID M1015 SAS/SATA Controller for System x (flashed it to only use it for the extra sata ports)
Intel Gigabit CT Desktop Adapter
2x Kingston ValueRAM KVR24E17S8K2/16I (for a total of 32GB ECC non-buffered ram)
Intel DC S3520 M.2 150GB (has protection against power failures)
Cooler Master V Series V550

This is my first post I hope I didn't forget anything; I have searched around the forum but could not find something similar, I might lack the wrong terminology for a effective search.

Samuel Tai · Nov 2, 2021

The likeliest explanation is either your HBA or your backplane is going bad. I had a similar situation with my pool, where I would have errors at a particular backplane slot when I ran a pool scrub, but the disks were fine in another slot. Once I replaced the backplane, the errors went away. So you have a hardware issue of some sort.

Nighteyes · Nov 2, 2021

I just noticed that I mislabeled read and write, it is the other way around.
This would mean that my IBM ServeRAID M1015 SAS/SATA Controller might be at fault or the PCI-e slot where it is seated? I'll have a loot at it.
What is the best course of action (after checking if the card is seated correctly), first replace the controller and or motherboard or first resilver the pool?

Samuel Tai · Nov 2, 2021

HBAs also tend to run hot, and this card may be overheating. I doubt the slot would be the issue. Try moving the card to a different slot to get more spacing around it for cooling, and see if you can get some airflow to it. Once it's running cooler, resilver your pool.

Nighteyes · Nov 2, 2021

Well the case is filled with 5 Noctua fans currently with the last 4 days with the case open, and the room temperature is 16 degrees. It ran perfectly with with a room temperature of 34 degrees in the summer and a closed case.
Slot might be an issue as I have seated a GPU in another slot to change the bios settings (it normally runs headless), I might have hit something?

Thanks for your posts, I really appreciate the help!
In this case I just have to get a new HBA. Which isn't that bad. Motherboard is a bit more work.
I will let you know tonight if I get the time to have a look at the NAS.

Nighteyes · Nov 2, 2021

Code:

Nov  2 00:13:02 freenas ZFS: vdev state changed, pool_guid=11153755991132659217 vdev_guid=11428268283530093663
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 72 Aborting command 0xfffffe0001202e80
Nov  2 04:07:26 freenas mps0: Sending reset from mpssas_send_abort for target ID 6
Nov  2 04:07:26 freenas     (pass1:mps0:0:6:0): LOG SENSE. CDB: 4d 00 4d 00 00 00 00 00 40 00 length 64 SMID 654 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): WRITE(10). CDB: 2a 00 45 ae 75 08 00 00 40 00 length 32768 SMID 813 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 10 ca 17 60 00 00 00 40 00 00 length 32768 SMID 319 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): WRITE(10). CDB: 2a 00 45 ae 75 08 00 00 40 00
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 73 26 f8 00 00 40 00 length 32768 SMID 640 terminated ioc 804b loginfo 31130000 scs(da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas i 0 state c xfer 0
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 70 00 00 40 00 length 32768 SMID 1048 terminated ioc 804b loginfo 31130000 sc(da0:si 0 state c xfer 0
Nov  2 04:07:26 freenas mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 30 00 00 40 00 length 32768 SMID 836 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 10 ca 17 60 00 00 00 40 00 00
Nov  2 04:07:26 freenas mps0: (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas Unfreezing devq for target ID 6
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 73 26 f8 00 00 40 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 70 00 00 40 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 30 00 00 40 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: Command timeout
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: SCSI Status Error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SCSI status: Check Condition
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Error 6, Retries exhausted
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Invalidating pack
Nov  2 04:07:26 freenas ZFS: vdev state changed, pool_guid=11153755991132659217 vdev_guid=14638836539066199327

Part of the current log. Seeing it says CAM status and SCSI status and they have errors it might be indeed the controller if I understand correctly.

Samuel Tai · Nov 2, 2021

Shuffle your disks around. If the errors follow the disk, then it's the disk that's failing. If the error stays at that port, it's that port, or cable, or the controller. If the errors randomly move, then it's the controller.

Spearfoot · Nov 2, 2021

Nighteyes said:

Code:

Nov  2 00:13:02 freenas ZFS: vdev state changed, pool_guid=11153755991132659217 vdev_guid=11428268283530093663
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 72 Aborting command 0xfffffe0001202e80
Nov  2 04:07:26 freenas mps0: Sending reset from mpssas_send_abort for target ID 6
Nov  2 04:07:26 freenas     (pass1:mps0:0:6:0): LOG SENSE. CDB: 4d 00 4d 00 00 00 00 00 40 00 length 64 SMID 654 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): WRITE(10). CDB: 2a 00 45 ae 75 08 00 00 40 00 length 32768 SMID 813 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 10 ca 17 60 00 00 00 40 00 00 length 32768 SMID 319 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): WRITE(10). CDB: 2a 00 45 ae 75 08 00 00 40 00
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 73 26 f8 00 00 40 00 length 32768 SMID 640 terminated ioc 804b loginfo 31130000 scs(da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas i 0 state c xfer 0
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 70 00 00 40 00 length 32768 SMID 1048 terminated ioc 804b loginfo 31130000 sc(da0:si 0 state c xfer 0
Nov  2 04:07:26 freenas mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas     (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 30 00 00 40 00 length 32768 SMID 836 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(16). CDB: 88 00 00 00 00 03 10 ca 17 60 00 00 00 40 00 00
Nov  2 04:07:26 freenas mps0: (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas Unfreezing devq for target ID 6
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 73 26 f8 00 00 40 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 70 00 00 40 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): READ(10). CDB: 28 00 5b 72 52 30 00 00 40 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: CCB request completed with an error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: Command timeout
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Retrying command
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): CAM status: SCSI Status Error
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SCSI status: Check Condition
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Error 6, Retries exhausted
Nov  2 04:07:26 freenas (da0:mps0:0:6:0): Invalidating pack
Nov  2 04:07:26 freenas ZFS: vdev state changed, pool_guid=11153755991132659217 vdev_guid=14638836539066199327

Part of the current log. Seeing it says CAM status and SCSI status and they have errors it might be indeed the controller if I understand correctly.

Yes, it might be either a bad drive or bad controller?

All of those log errors seem to be for disk da0; so try moving that particular drive to a different location/slot/port. You may have to swap drives around. If the error follows the drive, then it follows that the drive itself is bad.

Otherwise the problem lies with your HBA. @Samuel Tai gave some good advice about cooling and re-seating the card. If you've tried that and the problem persists, you'll need to replace the HBA.

Nighteyes · Nov 5, 2021

I reseated the IBM ServeRAID M1015 SAS/SATA Controller, it might have helped but it is a bit too soon to tell. All disk are connected to that controller and each time it is a different disk (4 so far). So this probably will exclude the two SAS to Sata cables.
I will probably order a new controller card and see if that will solve the problem; I also got my new single slot gpu so that I can enter the bios without too much of a hassle.

I will let you know when I know more :).

Nighteyes · Nov 10, 2021

I replaced my IBM ServeRAID M1015 and I'm still getting the CAM errors. Next step is replacing the motherboard :(.

I found the following topic:

CAM status: SCSI status error - what does it means?

Hello guys, during a scrub that started yesterday afternoon (and according to zpool status is scheduled to complete within 4 hours from now), I got an email from Freenas (security run output) with the following content that worries me a lot: Nov 9 16:47:24 freenas kernel: (da2:mps0:0:6:0)...

www.truenas.com

The suggestion there is updating the firmware of the SAS card. The firmware I used is perhaps older and might not be compatible with my Freenas setup? Would you agree that updating the firmware of either of the two cards is a good idea. I need the card to be running in IT mode.

Nighteyes · Nov 21, 2021

Also replaced my cables with more sturdy ones, sadly still getting the dreaded:
SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Next is to switch cables to the new controller. If this does not work I might reset the changed bios settings to see if that helps and or order a new motherboard.

Code:

Nov 21 09:40:00 freenas     (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 669 Aborting command 0xfffffe00012c7e10
Nov 21 09:40:00 freenas mps1: Sending reset from mpssas_send_abort for target ID 6
Nov 21 09:40:00 freenas     (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 51 a7 e8 00 00 08 00 length 4096 SMID 85 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 21 09:40:00 freenas mps1: Unfreezing devq for target ID 6
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 51 a7 e8 00 00 08 00
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): CAM status: CCB request completed with an error
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): CAM status: Command timeout
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 51 a7 e8 00 00 08 00
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): Retrying command (per sense data)
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 2f 50 00 00 40 00
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): Retrying command (per sense data)
Nov 21 12:59:36 freenas     (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 228 Aborting command 0xfffffe00012a3b40
Nov 21 12:59:36 freenas mps1: Sending reset from mpssas_send_abort for target ID 6
Nov 21 12:59:36 freenas     (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 7b f0 00 00 40 00 length 32768 SMID 485 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 21 12:59:36 freenas     (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 56 fa 80 00 00 20 00 length 16384 SMID 577 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 21 12:59:36 freenas mps1: Unfreezing devq for target ID 6
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 7b f0 00 00 40 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: CCB request completed with an error
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 56 fa 80 00 00 20 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: CCB request completed with an error
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: Command timeout
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 56 fa 80 00 00 20 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command (per sense data)
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 7c 30 00 00 40 00
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): Retrying command (per sense data)

Spearfoot · Nov 21, 2021

Nighteyes said:

Also replaced my cables with more sturdy ones, sadly still getting the dreaded:
SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Next is to switch cables to the new controller. If this does not work I might reset the changed bios settings to see if that helps and or order a new motherboard.

Code:

Nov 21 09:40:00 freenas     (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 669 Aborting command 0xfffffe00012c7e10
Nov 21 09:40:00 freenas mps1: Sending reset from mpssas_send_abort for target ID 6
Nov 21 09:40:00 freenas     (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 51 a7 e8 00 00 08 00 length 4096 SMID 85 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 21 09:40:00 freenas mps1: Unfreezing devq for target ID 6
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 51 a7 e8 00 00 08 00
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): CAM status: CCB request completed with an error
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): CAM status: Command timeout
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 51 a7 e8 00 00 08 00
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 09:40:00 freenas (da0:mps1:0:6:0): Retrying command (per sense data)
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 2f 50 00 00 40 00
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 09:40:01 freenas (da0:mps1:0:6:0): Retrying command (per sense data)
Nov 21 12:59:36 freenas     (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 228 Aborting command 0xfffffe00012a3b40
Nov 21 12:59:36 freenas mps1: Sending reset from mpssas_send_abort for target ID 6
Nov 21 12:59:36 freenas     (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 7b f0 00 00 40 00 length 32768 SMID 485 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 21 12:59:36 freenas     (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 56 fa 80 00 00 20 00 length 16384 SMID 577 terminated ioc 804b loginfo 31130000 scsi 0 state c xfer 0
Nov 21 12:59:36 freenas mps1: Unfreezing devq for target ID 6
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 7b f0 00 00 40 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: CCB request completed with an error
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 56 fa 80 00 00 20 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: CCB request completed with an error
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: Command timeout
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 56 fa 80 00 00 20 00
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 12:59:36 freenas (da0:mps1:0:6:0): Retrying command (per sense data)
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): WRITE(10). CDB: 2a 00 30 41 7c 30 00 00 40 00
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): CAM status: SCSI Status Error
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): SCSI status: Check Condition
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Nov 21 12:59:37 freenas (da0:mps1:0:6:0): Retrying command (per sense data)

Still failing on da0, so I wonder if it's just a bad disk...

Nighteyes · Nov 29, 2021

Spearfoot said:
Still failing on da0, so I wonder if it's just a bad disk...

Sadly no, it has been failing on all disks (just to make sure I checked the serial number each time after a reboot). I have ordered a new motherboard, last thing I suspect is the PCI-E traces where damaged when plugging in my temporary videocard, I had to use a bit more force than I was used too since the case has a flaw.
Also changed bios settings back to what it used to be with no avail.

The weird part is, the whole NAS just works just fine, writing reading everything. It just kicks out the disks after a while. If it was some TLER bug I would have expected it 2 years ago when I build it or perhaps after upgrading to 11.3 not now.
It can't be that all disks decided to die on the same day... they aren't even from the same batch since I had a DOA the first week after building it.

rvassar · Nov 29, 2021

I ran into a similar problem with a flashed PERC H310. I use a gamer case, and in the end the fix was ordering replacement 8087 breakout cables with locking 90 deg ends. The disks mount sideways in a tray, and the side of the case was imposing stress on two of the drives cables. So it appeared to be drive affinity, but was actually cable/position affinity.

Jailer · Nov 29, 2021

Nighteyes said:
The firmware I used is perhaps older and might not be compatible with my Freenas setup?

You need to have the right firmware for the version of TrueNAS (FreeNAS) that you are running. What version of TrueNAS are you running and what firmware version is you HBA running?

Nighteyes · Dec 5, 2021

I used this guide:

IBM ServeRAID M1015 Part 4: Cross flashing to a LSI9211-8i in IT or IR mode

Guide for the IBM ServeRAID M1015 Part 4: Cross flashing to a LSI9211-8i in IT or IR mode

www.servethehome.com

With the following rom:
2118it
And since I got an UEFI bios:

Flashing IT Firmware to the IBM-ServeRAID-M1015 SAS HBA

The IBM M1015 is the go to Host Bus Adapter (HBA) for enthusiasts wanting a reliable and reasonably priced HBA for systems using advanced filesystems such as ZFS. I acquired a number of these cards for my file-server upgrade. I needed cards to support my new Norco RPC-4224 Chassis. There is a...

opticpow.io

I'm not quite sure where to find the newer firmware sadly. I searched for it but not quite sure what to get.

Replaced the motherboard also, problem still persists; again a different drive. Also since the motherboard has a newer bios version I might have lost ECC support. There is no way to check for that sadly, running Ubuntu, making my memory unstable and then run edac-utils seems like a stupid solution. Is there a way to check if ECC is enabled?

I currently run Freenas 11.3-U5 with the upgraded pool; and to be honest the newer GUI looks nice but I rather not have upgraded since it has given me only problems.

Nighteyes · Dec 5, 2021

rvassar said:
I ran into a similar problem with a flashed PERC H310. I use a gamer case, and in the end the fix was ordering replacement 8087 breakout cables with locking 90 deg ends. The disks mount sideways in a tray, and the side of the case was imposing stress on two of the drives cables. So it appeared to be drive affinity, but was actually cable/position affinity.

I replaced all cables (the "luck" required for a 4 cables to be dead seems a bit too high for me :P), I replaced the IBM ServeRaid M1015 and also replaced the motherboard now. I'm quite sure now it isn't any hardware problem. CPU and memory pass Memtest X86+.

joeschmuck · Dec 6, 2021

So let me see if I have this correct, the system was basically fine except for locking up periodically about once a month. You made two BIOS changes and all hell breaks loose. I would recommend doing these in order and note if anything helps. I know you replaced the motherboard too but if you install the original and put the system back to it's original condition... here are some things to try.

1. Have you changed the BIOS settings back to original settings to see if the new problems go away? If not, I would recommend changing the ECC setting first and then the C6 setting. Figure out which one is causing the issue. If that doesn't help, reset the BIOS for factory default, maybe you changed something else that you didn't realize.

2. You stated you are running 6 hard drives. If not already done, plug four of your drives into the motherboard SATA connectors, then use two ports from the M1015 card. See if this reduces the errors, move the two drives around to different connectors on the M1015 to see if this helps/hurts. Maybe run one hard drive per cable set? But I don't understand how the BIOS change you said yo made could have caused this.

3. If you find that the problem is the M1015, you could try a simple SATA board (see my signature) that could solve this issue and it doesn't need to be loaded with unique firmware.

Good Luck

Nighteyes · Dec 6, 2021

Well yes, and I added a graphics card for a short period of time.

1. I have change the settings back since and that didn't help. The ECC setting should not have changed anything, if on Auto to Enabled behaves the same (looked that up). The C6 setting is on again also.

2. I have tried 2 SAS cards with 4 different cables now. Also I'm quite sure it cant be the cable since I tracked some of the behaviour now:

Code:

Disk layout
    SAS Con1    Sas Con2
1    ZA29MJ**    ZA2866**
2    ZA29MA**    ZA27ZL**
3    ZA29MK**    ZA29MJ**

04-12-2021 da0 ZA29MJ** SAS error
05-12-2021 da4 ZA29MK** SAS error
06-12-2021 da1 ZA27GL** SAS error

Also on 06 da0 and da4 again gave the error today. The pool has not degraded yet, no idea why. Not enough errors perhaps to kick a disk out.

Might the SAS errors have been that I upgraded my ZFS pool?

Sadly I noticed the following errors:

Code:

Dec  5 11:36:29 freenas MCA: Bank 15, Status 0xd42040000000011b
Dec  5 11:36:29 freenas MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000
Dec  5 11:36:29 freenas MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 0
Dec  5 11:36:29 freenas MCA: CPU 0 COR OVER GCACHE LG RD error
Dec  5 11:36:29 freenas MCA: Address 0x400000050203100

Seems like ECC is working as intended.... but this is not the best way to find out. I think this might have to do with the new motherboard and timing settings that are not correct.

joeschmuck · Dec 7, 2021

Provide the output of "sas2flash.efi –list" (I think that is the command) so we can verify that the M1015 was flashed correctly. Not that we don't believe you but it's easy to mess this up. Been there myself a long time ago.

Also the other CPU error you got looks like an ECC issue or unsupported CPU, not sure and I didn't look into it very hard but there are postings about that error message. I would also recommend that you burn-in your CPU and RAM again to verify that you have no issues. you have done a number of changes and you may have induced an error.

Important Announcement for the TrueNAS Community.

Pool degraded no SMART failure

Dabbler

Never underestimate your own stupidity

Dabbler

Never underestimate your own stupidity

Dabbler

Dabbler

Never underestimate your own stupidity

He of the long foot

Dabbler

Dabbler

Dabbler

He of the long foot

Dabbler

Guru

Not strong, but bad

Dabbler

Dabbler

Old Man

Dabbler

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool degraded no SMART failure"

Similar threads