Pool keeps degrading.

TheUsD · Dec 28, 2021

TrueNAS Setup:
OS: TrueNAS-12.0-U7
MB: GIGABYTE H370M DS3H
CPU: Intel i5 8500
RAM: 4x8GB (32GB) DDR4 2400 (NON-ECC)
HBA: SAS9211-8I 8PORT Int 6GB (in IT Mode) <- in a PCIe 4x slot
NIC: dual Intel 82599 SFP+ 10Gbps DACs (in LACP)
Case: SilverStone RM21-308
PSU: 600watt

A few months ago one of my zraid1 pools of 4, 4TB drives (da4-da7) went degraded after 2 drives starting having w/r errors (da5 and da7). Not putting a lot of thought into the issue, figured it was time for new drives since they were from 2014.

Replaced with new 8TB Iron Wolfs, ATA ST8000VN004-2M2. After creating a new pool and datasheet, the same two disk bays reported the new drives were having W/R errors. I ran Long, Short, Conveyance, offline S.M.A.R.T tests on each drive and they all came back successful. Reached out to SilverStone to see if I could get a replacement backplane. First backplane was DOA. Now I have a 2nd backplane and any disk in da4 reports W/R errors soon as I try to transfer data via NFS share attached to ESXi hosts. The data travels over the SFP+ card.

If I try to export/destroy the degraded pool that contains disks da4-da7 and there is no data on the pool, it will always hang up on 60%, the GUI becomes unresponsive. If I try to shutdown / reboot with putty or shell, the TrueNAS will hang at a random spot in the shutdown process. I've let it sit like this for about an hour before forcing shutdown.

I tried swapping cables from backplane to see if the issue would travel in hopes it was either a bad cable or bad HBA card but I think I might have screwed that up because I did not export the pool before moving the cable. Just shutdown the TrueNAS box, moved cable and booted up.

Before I cause harm or damage to my other two pools that are working perfectly, what are some good troubleshooting steps to take?

Some of the errors I see when creating pools, datasheets and trying to transfer data:

Code:

Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): Retrying command (per sense data)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): READ(6). CDB: 08 00 00 80 01 00
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): CAM status: SCSI Status Error
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI status: Check Condition
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): Retrying command (per sense data)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): READ(6). CDB: 08 00 00 80 01 00
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): CAM status: SCSI Status Error
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI status: Check Condition
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): Retrying command (per sense data)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): READ(6). CDB: 08 00 00 80 01 00
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): CAM status: SCSI Status Error
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI status: Check Condition
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Dec 28 16:39:42 TrueNAS-Container (da7:mps0:0:15:0): Error 5, Retries exhausted
Dec 28 16:39:42 TrueNAS-Container GEOM_ELI: Device mirror/swap4.eli created.
Dec 28 16:39:42 TrueNAS-Container GEOM_ELI: Encryption: AES-XTS 128
Dec 28 16:39:42 TrueNAS-Container GEOM_ELI:

Code:

Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): Error 5, Retries exhausted
Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): READ(10). CDB: 28 00 00 3f ff 80 00 01 00 00
Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): CAM status: SCSI Status Error
Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): SCSI status: Check Condition
Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Dec 28 17:00:07 TrueNAS-Container (da7:mps0:0:15:0): Error 5, Retries exhausted
Dec 28 17:00:07 TrueNAS-Container GEOM_MIRROR: Device mirror/swap1 launched (2/2).
Dec 28 17:00:07 TrueNAS-Container GEOM_ELI: Device mirror/swap0.eli created.
Dec 28 17:00:07 TrueNAS-Container GEOM_ELI: Encryption: AES-XTS 128
Dec 28 17:00:07 TrueNAS-Container GEOM_ELI:     Crypto: hardware
Dec 28 17:00:07 TrueNAS-Container GEOM_ELI: Device mirror/swap1.eli created.
Dec 28 17:00:07 TrueNAS-Container GEOM_ELI: Encryption: AES-XTS 128
Dec 28 17:00:07 TrueNAS-Container GEOM_ELI:

As of this last test, da5 and da7 are now back to showing bad W/Rs but all day Sunday and Monday is was only da4.

If you need more info or believe I have forgotten something, please feel free to inquire.

JohnDigital · Dec 29, 2021

TheUsD said:
I tried swapping cables from backplane to see if the issue would travel in hopes it was either a bad cable or bad HBA card but I think I might have screwed that up because I did not export the pool before moving the cable. Just shutdown the TrueNAS box, moved cable and booted up.

These types of errors are usually associated with cabling of some sort. However I tend to take these as some of the earliest signs youll get of a disk going bad or was always bad, which has happened to me out of the box knowing the cabling is brand new and connected as well.

Ive had faulty power supply/splitters cause these types of things

Reseating all connectors, power, data, and physically reseat the HBA, the cables, and trying new cabling wherever possible.

If issue persists try running those 4 disks out of a different port on the HBA or directly of SATA. Do the errors stay the same or follow to the new ports? And be advised you should be able to connect disks to whatever ports you have available this should not harm anything as TN used GPTIDs for disk import, if it finds it theres no issue.

TheUsD · Jan 3, 2022

Move my other pool that was having no issues to the top backplane, cable, port and power source the degrading pool was using. No issues with the pool existing pool.

Obtained 4 new Iron Wolf disks and placed them on the same bottom plane, cable, port and power source the healthy pool came from and experienced the same issues.

Found 4, old 4TB WD Reds and placed them on the same bottom plane, cable port and power source as the degrading pool in question. One of the drives has a SMART failure, however I was able to move data to and from without any degrading. Looking at the different between the 4TB WDs and the 8TB Iron Wolfs, I noticed a different in power.
4TB WD: +5VDC: 0.60A, +12VDC: 0.45A
8TB IW: +5VDC: 0.85A, +12VDC: 0.99A

Could my issues be a result of insufficient power? If so, would this have not effected the other 4, 8TB Exos?
Their power general power consumption is: +5VDC: 0.75, +12VDC: 0.99

The PSU is: a 530watt, powering 8 spinning disks, 6 SATA drives, 4 RAM slots, 3, 60MM fans 2 M.2, 1 HBA 8i, 1 Dual SFP+, Intel i8500, and a partridge and a pear tree.

jgreco · Jan 3, 2022

TheUsD said:
(in IT Mode)

Also make sure you're running firmware 20.00.07.00. Random errors can be a symptom of old firmware.

TheUsD · Jan 3, 2022

jgreco said:
Also make sure you're running firmware 20.00.07.00. Random errors can be a symptom of old firmware.

Yes, I have confirmed that.

Code:

 Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:03:00:00
        SAS Address                    : 500605b-0-013c-a580
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9211-8i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

JohnDigital · Jan 3, 2022

John Digital said:
Ive had faulty power supply/splitters cause these types of things

Yes, power can cause these types of errors too, maybe not this one as they seem to mostly be a mystery.

Its a very good observation and IMO something to look into further.

TheUsD · Jan 6, 2022

Insufficient power. Replaced with a 750watt gold and the pool has no longer degraded.

jgreco · Jan 6, 2022

TheUsD said:
Insufficient power. Replaced with a 750watt gold and the pool has no longer degraded.

Ah crud. That was actually obvious; I shoulda read your hardware manifest and followup message more closely (or at all?)

Proper Power Supply Sizing Guidance

I've seen about 1,000 threads like this one where people decide that they can power a dozen hard drives off a 360 watt supply. DO NOT DO THIS. I've seen another 1,000 threads where people decide to buy the cheapest power supply that they can find. DO NOT DO THIS. Your NAS lives or dies by...

www.truenas.com

Your original message said

PSU: 600watt

and I was seeing a single-drive pattern in your HBA errors, and a 600W PSU would be in the "that could be a bit tight but should work" range. The 530 you mention later feels uncomfortably small.

TheUsD · Jan 7, 2022

jgreco said:
Your original message said

PSU: 600watt

It did, because honestly I thought I didn't choose anything less.
When I originally built the nas box, I put a small 430w in there for the 6 ssds. When I added more storage (4, 8tb drives and 4, 4tb drives) I opted to get a larger PSU. I was pretty sure I purchased a 600w. Because of the issues I was having related to this post, I investigated further and found out I was incorrect.

Apologies for the misleading and posting incorrect H/W specification.

I do appreciate our time and help, though!

Now with this resolved. I have to figure out what I am going to do with the spare LSi9210 and 9300 I purchased in case the HBA card was going bad, haha. ‍

Important Announcement for the TrueNAS Community.

Pool keeps degrading.

TheUsD

Contributor

JohnDigital

Guru

TheUsD

Contributor

jgreco

Resident Grinch

TheUsD

Contributor

JohnDigital

Guru

TheUsD

Contributor

jgreco

Resident Grinch

Proper Power Supply Sizing Guidance

TheUsD

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Pool keeps degrading.

Contributor

Guru

Contributor

Resident Grinch

Contributor

Guru

Contributor

Resident Grinch

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool keeps degrading."

Similar threads