Failed and Degraded disks

sash · Sep 22, 2023

Hello. Brand new to true nas. Built a system to host all my personal data, plex, VMs and more. Acquired 12 white label Seagate 20TB SAS drives from https://shop.digitalspaceport.com
Before putting any data on them I decided to run some tests. Just finished a long test on all of them. Took about 1.5 days to finish. Then ran a very quick Scrub, which took like 3 seconds, since the drives are empty. Now under Devices section I see two faulted and two degraded disks. Here are the output commands for each drive. The info is identical for all four, I am giving the first drive with Faulted status. Please help me figure out if drives should be sent back for replacement as I have only 30 days to return them.

Code:

xxxx@truenas ~ % sudo smartctl -a /dev/sdd
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:             
Product:              OOS20000G
Revision:             OOS1
Compliance:           SPC-5
User Capacity:        20,000,588,955,648 bytes [20.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500da75356f
Serial number:        0007QPYR0000W319HVE7
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Sep 22 21:32:50 2023 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 278:31
Manufactured in week 31 of year 2023
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  8
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2495
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.354           0
write:         0        0         0         0          0         14.879           0

Non-medium error count:        0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -     276                 - [-   -    -]
# 2  Background short  Completed                   -     213                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Got an email as well:

TrueNAS @ truenas

New alert:

Pool RAIDZ2 state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:
- Disk OOS20000G 0007QPYR0000W319HVE7 is FAULTED
- Disk OOS20000G 0007BLPZ0000W319JLD6 is DEGRADED
- Disk OOS20000G 0008A0LF0000W320NMSU is FAULTED
- Disk OOS20000G 0007W6WJ0000C32431HJ is DEGRADED

sash · Sep 22, 2023

Code:

xxxx@truenas ~ % sudo dmesg | grep error
[14395.589225] blk_update_request: I/O error, dev sdh, sector 48981192 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[14395.590350] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/a370fbd9-5922-4948-85e1-38144ccf95d4 error=5 type=2 offset=22930821120 size=4096 flags=180880
[98947.270098] blk_update_request: I/O error, dev sdm, sector 46586656 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[98947.271221] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/a243bfcd-b2e1-4e7b-bc8c-400629c0e344 error=5 type=2 offset=21704818688 size=4096 flags=180880
[107649.877674] blk_update_request: I/O error, dev sdd, sector 7001896 op 0x0:(READ) flags 0x4700 phys_seg 128 prio class 0
[107649.879039] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/4ac6b945-7dae-44c9-a2e0-3bd4ee4eb2ea error=5 type=1 offset=1437421568 size=1044480 flags=40080cb0
[107650.391658] blk_update_request: I/O error, dev sdj, sector 12603256 op 0x0:(READ) flags 0x4700 phys_seg 128 prio class 0
[107650.393145] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/fe590e54-f4f5-4754-8d8e-a4f34621abd2 error=5 type=1 offset=4305317888 size=1044480 flags=40080cb0
[107650.660955] blk_update_request: I/O error, dev sdm, sector 18195152 op 0x0:(READ) flags 0x700 phys_seg 80 prio class 0
[107650.660973] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/a243bfcd-b2e1-4e7b-bc8c-400629c0e344 error=5 type=1 offset=7168368640 size=413696 flags=40080cb0
[107650.877876] blk_update_request: I/O error, dev sdk, sector 23798136 op 0x0:(READ) flags 0x4700 phys_seg 128 prio class 0
[107650.880055] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/553050d9-971f-442e-af10-ed851efb06c1 error=5 type=1 offset=10037096448 size=1048576 flags=40080cb0
[107651.078077] blk_update_request: I/O error, dev sdg, sector 23812888 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[107651.079300] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/22ef77fa-5b61-4e36-a89e-f92ba118228d error=5 type=1 offset=10044649472 size=4096 flags=1808b0
[107651.265975] blk_update_request: I/O error, dev sdf, sector 23828504 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[107651.268354] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/d4f531bc-a8c2-4d64-805e-fc4177921e94 error=5 type=1 offset=10052644864 size=12288 flags=1808b0
[107651.553306] blk_update_request: I/O error, dev sdj, sector 41105264 op 0x0:(READ) flags 0x700 phys_seg 24 prio class 0
[107651.556220] zio pool=RAIDZ2 vdev=/dev/disk/by-partuuid/fe590e54-f4f5-4754-8d8e-a4f34621abd2 error=5 type=1 offset=18898345984 size=163840 flags=40080cb0

Code:

xxxx@truenas ~ % sudo dmesg | grep sdd
[    4.780752] sd 0:0:3:0: [sdd] 39063650304 512-byte logical blocks: (20.0 TB/18.2 TiB)
[    4.780785] sd 0:0:3:0: [sdd] 4096-byte physical blocks
[    4.781751] sd 0:0:3:0: [sdd] Write Protect is off
[    4.781764] sd 0:0:3:0: [sdd] Mode Sense: df 00 10 08
[    4.783610] sd 0:0:3:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    4.840398]  sdd: sdd1 sdd2
[    4.891917] sd 0:0:3:0: [sdd] Attached SCSI disk
[107649.872326] sd 0:0:3:0: [sdd] tag#202 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[107649.874408] sd 0:0:3:0: [sdd] tag#202 Sense Key : Aborted Command [current] [descriptor]
[107649.875466] sd 0:0:3:0: [sdd] tag#202 Add. Sense: Nak received
[107649.876522] sd 0:0:3:0: [sdd] tag#202 CDB: Read(16) 88 00 00 00 00 00 00 6a d7 28 00 00 07 a8 00 00
[107649.877674] blk_update_request: I/O error, dev sdd, sector 7001896 op 0x0:(READ) flags 0x4700 phys_seg 128 prio class 0

Code:

xxxx@truenas ~ % sudo dmesg | grep sdj
[    4.780731] sd 0:0:9:0: [sdj] 39063650304 512-byte logical blocks: (20.0 TB/18.2 TiB)
[    4.780777] sd 0:0:9:0: [sdj] 4096-byte physical blocks
[    4.781737] sd 0:0:9:0: [sdj] Write Protect is off
[    4.781752] sd 0:0:9:0: [sdj] Mode Sense: df 00 10 08
[    4.783702] sd 0:0:9:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    4.848479]  sdj: sdj1 sdj2
[    4.903063] sd 0:0:9:0: [sdj] Attached SCSI disk
[107650.386473] sd 0:0:9:0: [sdj] tag#218 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[107650.387797] sd 0:0:9:0: [sdj] tag#218 Sense Key : Aborted Command [current] [descriptor]
[107650.389080] sd 0:0:9:0: [sdj] tag#218 Add. Sense: Nak received
[107650.390392] sd 0:0:9:0: [sdj] tag#218 CDB: Read(16) 88 00 00 00 00 00 00 c0 4f 78 00 00 05 f0 00 00
[107650.391658] blk_update_request: I/O error, dev sdj, sector 12603256 op 0x0:(READ) flags 0x4700 phys_seg 128 prio class 0
[107651.542418] sd 0:0:9:0: [sdj] tag#196 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[107651.545409] sd 0:0:9:0: [sdj] tag#196 Sense Key : Aborted Command [current] [descriptor]
[107651.548015] sd 0:0:9:0: [sdj] tag#196 Add. Sense: Nak received
[107651.550641] sd 0:0:9:0: [sdj] tag#196 CDB: Read(16) 88 00 00 00 00 00 02 73 37 70 00 00 01 40 00 00
[107651.553306] blk_update_request: I/O error, dev sdj, sector 41105264 op 0x0:(READ) flags 0x700 phys_seg 24 prio class 0

sash · Sep 22, 2023

Code:

xxxx@truenas ~ % sudo zpool status
  pool: RAIDZ2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 3.18M in 00:00:03 with 0 errors on Fri Sep 22 21:29:21 2023
config:

    NAME                                      STATE     READ WRITE CKSUM
    RAIDZ2                                    DEGRADED     0     0     0
      raidz2-0                                DEGRADED     0     0     0
        4ac6b945-7dae-44c9-a2e0-3bd4ee4eb2ea  FAULTED    119     0     0  too many errors
        d4f531bc-a8c2-4d64-805e-fc4177921e94  ONLINE       1     0     0
        2ca8f0f8-680c-475d-9a6f-ea7c7afadcef  ONLINE       0     0     0
        4e5883bc-1672-47c1-9c94-9fbfec45d32a  ONLINE       0     0     0
        553050d9-971f-442e-af10-ed851efb06c1  DEGRADED   162     0     0  too many errors
        9ef53eff-6728-4518-ab32-d434303e1876  ONLINE       0     0     0
        b89b3d3f-74b0-415c-960f-31c1631381c2  ONLINE       0     0     0
        4867a2ba-119c-4955-af52-9e72e463d239  ONLINE       0     0     0
        fe590e54-f4f5-4754-8d8e-a4f34621abd2  FAULTED    159     0     0  too many errors
        a243bfcd-b2e1-4e7b-bc8c-400629c0e344  DEGRADED    68     1     0  too many errors
        a370fbd9-5922-4948-85e1-38144ccf95d4  ONLINE       0     1     0
        22ef77fa-5b61-4e36-a89e-f92ba118228d  ONLINE       1     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
    The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:02 with 0 errors on Wed Sep 20 06:45:04 2023
config:

    NAME         STATE     READ WRITE CKSUM
    boot-pool    ONLINE       0     0     0
      nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

joeschmuck · Sep 23, 2023

Please use code brackets in the future when posting an output as to retain the text format (indention).

I tried to find out some specs for drive model OOS20000G but not successful.

The drive data you listed has only one real concern to me which was how many load cycles there were for af few hours are on the drive. It's not your issue but just wanted to mention it.

Can you output smartctl -x /dev/sdd to provide a little more information about the drive. But I don't think it's the drive as it would have to be several drives and unless they are faulty all at the same time, eh, probably not.

Lastly and most important, being a member since 2017 I'd think you realized that you need to post all of your system specs. We have no idea the CPU, RAM, Motherboard, HBA (and firmware), are you using a drive cage/backplane, or TrueNAS version other than Scale? Go into as much detail as possible, it's the little things that help. Also, do you sleep the drives?

Etorix · Sep 23, 2023

The CODE tag:

sash · Sep 23, 2023

joeschmuck said:
Please use code brackets in the future when posting an output as to retain the text format (indention).

I tried to find out some specs for drive model OOS20000G but not successful.

The drive data you listed has only one real concern to me which was how many load cycles there were for af few hours are on the drive. It's not your issue but just wanted to mention it.

Can you output smartctl -x /dev/sdd to provide a little more information about the drive. But I don't think it's the drive as it would have to be several drives and unless they are faulty all at the same time, eh, probably not.

Lastly and most important, being a member since 2017 I'd think you realized that you need to post all of your system specs. We have no idea the CPU, RAM, Motherboard, HBA (and firmware), are you using a drive cage/backplane, or TrueNAS version other than Scale? Go into as much detail as possible, it's the little things that help. Also, do you sleep the drives?

You are right. Hid output behind the code tags and added my system specs. The drive is actually ST20000NM002D model.
Here is the direct link: https://shop.digitalspaceport.com/p...12gb-s-256mb-cache-3-5-white-label-hard-drive
Funny thing is, after running the long test and scrub I saw the following:

/dev/sdd FAULTED
/dev/sdj FAULTED
/dev/sdk DEGRADED
/dev/sdm DEGRADED

I cleared the status with sudo zpool clear command and did another scrub. After that I got these statuses:

/dev/sdd DEGRADED
/dev/sde FAULTED
/dev/sdi FAULTED
/dev/sdl DEGRADED
/dev/sdm DEGRADED

I did yet another sudo zpool clear and now only one drive is marked as faulted:

/dev/sdm FAULTED

Below is the command output.
Could the problem be backplane related, since both chassis and backplane are used. I actually replaced backplane that came with chassis to this one BPN-SAS3-826EL1.

Code:

sash@truenas ~ % sudo smartctl -x /dev/sdd


smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)


Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org





=== START OF INFORMATION SECTION ===


Vendor:              


Product:              OOS20000G


Revision:             OOS1


Compliance:           SPC-5


User Capacity:        20,000,588,955,648 bytes [20.0 TB]


Logical block size:   512 bytes


Physical block size:  4096 bytes


LU is fully provisioned


Rotation Rate:        7200 rpm


Form Factor:          3.5 inches


Logical Unit id:      0x5000c500da75356f


Serial number:        0007QPYR0000W319HVE7


Device type:          disk


Transport protocol:   SAS (SPL-3)


Local Time is:        Sat Sep 23 09:36:59 2023 EDT


SMART support is:     Available - device has SMART capability.


SMART support is:     Enabled


Temperature Warning:  Enabled


Read Cache is:        Enabled


Writeback Cache is:   Enabled





=== START OF READ SMART DATA SECTION ===


SMART Health Status: OK





Grown defects during certification <not available>


Total blocks reassigned during format <not available>


Total new blocks reassigned <not available>


Power on minutes since format <not available>


Current Drive Temperature:     38 C


Drive Trip Temperature:        60 C





Manufactured in week 31 of year 2023


Specified cycle count over device lifetime:  50000


Accumulated start-stop cycles:  8


Specified load-unload count over device lifetime:  600000


Accumulated load-unload cycles:  2500


Elements in grown defect list: 0





Error counter log:


           Errors Corrected by           Total   Correction     Gigabytes    Total


               ECC          rereads/    errors   algorithm      processed    uncorrected


           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors


read:          0        0         0         0          0          0.643           0


write:         0        0         0         0          0         19.645           0





Non-medium error count:        0








[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']


SMART Self-test log


Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]


     Description                              number   (hours)


# 1  Background long   Completed                   -     276                 - [-   -    -]


# 2  Background short  Completed                   -     213                 - [-   -    -]





Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]





Background scan results log


  Status: no scans active


    Accumulated power on time, hours:minutes 290:35 [17435 minutes]


    Number of background scans performed: 0,  scan progress: 0.00%


    Number of background medium scans performed: 0





Protocol Specific port log page for SAS SSP


relative target port id = 1


  generation code = 0


  number of phys = 1


  phy identifier = 0


    attached device type: expander device


    attached reason: SMP phy control function


    reason: loss of dword synchronization


    negotiated logical link rate: phy enabled; 12 Gbps


    attached initiator port: ssp=0 stp=0 smp=0


    attached target port: ssp=0 stp=0 smp=1


    SAS address = 0x5000c500da75356d


    attached SAS address = 0x500304800911433f


    attached phy identifier = 1


    Invalid DWORD count = 0


    Running disparity error count = 0


    Loss of DWORD synchronization = 2


    Phy reset problem = 0


    Phy event descriptors:


     Invalid word count: 0


     Running disparity error count: 0


     Loss of dword synchronization count: 2


     Phy reset problem count: 0


relative target port id = 2


  generation code = 0


  number of phys = 1


  phy identifier = 1


    attached device type: no device attached


    attached reason: unknown


    reason: unknown


    negotiated logical link rate: phy enabled; unknown


    attached initiator port: ssp=0 stp=0 smp=0


    attached target port: ssp=0 stp=0 smp=0


    SAS address = 0x5000c500da75356e


    attached SAS address = 0x0


    attached phy identifier = 0


    Invalid DWORD count = 0


    Running disparity error count = 0


    Loss of DWORD synchronization = 0


    Phy reset problem = 0


    Phy event descriptors:


     Invalid word count: 0


     Running disparity error count: 0


     Loss of dword synchronization count: 0


     Phy reset problem count: 0

sash@truenas ~ % sudo smartctl -x /dev/sdn

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF INFORMATION SECTION ===

Vendor:              

Product:              OOS20000G

Revision:             OOS1

Compliance:           SPC-5

User Capacity:        20,000,588,955,648 bytes [20.0 TB]

Logical block size:   512 bytes

Physical block size:  4096 bytes

LU is fully provisioned

Rotation Rate:        7200 rpm

Form Factor:          3.5 inches

Logical Unit id:      0x5000c500d7347a67

Serial number:        000011NV0000C2521JPK

Device type:          disk

Transport protocol:   SAS (SPL-3)

Local Time is:        Sat Sep 23 09:37:32 2023 EDT

SMART support is:     Available - device has SMART capability.

SMART support is:     Enabled

Temperature Warning:  Enabled

Read Cache is:        Enabled

Writeback Cache is:   Enabled



=== START OF READ SMART DATA SECTION ===

SMART Health Status: OK



Grown defects during certification <not available>

Total blocks reassigned during format <not available>

Total new blocks reassigned <not available>

Power on minutes since format <not available>

Current Drive Temperature:     34 C

Drive Trip Temperature:        60 C



Manufactured in week 31 of year 2023

Specified cycle count over device lifetime:  50000

Accumulated start-stop cycles:  9

Specified load-unload count over device lifetime:  600000

Accumulated load-unload cycles:  2522

Elements in grown defect list: 0



Error counter log:

           Errors Corrected by           Total   Correction     Gigabytes    Total

               ECC          rereads/    errors   algorithm      processed    uncorrected

           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors

read:          0        0         0         0          0          0.210           0

write:         0        0         0         0          0         13.242           0



Non-medium error count:        0





[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']

SMART Self-test log

Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]

     Description                              number   (hours)

# 1  Background long   Completed                   -     276                 - [-   -    -]

# 2  Background short  Completed                   -     213                 - [-   -    -]



Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]



Background scan results log

  Status: no scans active

    Accumulated power on time, hours:minutes 290:18 [17418 minutes]

    Number of background scans performed: 0,  scan progress: 0.00%

    Number of background medium scans performed: 0



Protocol Specific port log page for SAS SSP

relative target port id = 1

  generation code = 0

  number of phys = 1

  phy identifier = 0

    attached device type: expander device

    attached reason: hard reset

    reason: hard reset

    negotiated logical link rate: phy enabled; 12 Gbps

    attached initiator port: ssp=0 stp=0 smp=0

    attached target port: ssp=0 stp=0 smp=1

    SAS address = 0x5000c500d7347a65

    attached SAS address = 0x500304800911433f

    attached phy identifier = 11

    Invalid DWORD count = 0

    Running disparity error count = 0

    Loss of DWORD synchronization = 2

    Phy reset problem = 0

    Phy event descriptors:

     Invalid word count: 0

     Running disparity error count: 0

     Loss of dword synchronization count: 2

     Phy reset problem count: 0

relative target port id = 2

  generation code = 0

  number of phys = 1

  phy identifier = 1

    attached device type: no device attached

    attached reason: unknown

    reason: unknown

    negotiated logical link rate: phy enabled; unknown

    attached initiator port: ssp=0 stp=0 smp=0

    attached target port: ssp=0 stp=0 smp=0

    SAS address = 0x5000c500d7347a66

    attached SAS address = 0x0

    attached phy identifier = 0

    Invalid DWORD count = 0

    Running disparity error count = 0

    Loss of DWORD synchronization = 0

    Phy reset problem = 0

    Phy event descriptors:

     Invalid word count: 0

     Running disparity error count: 0

     Loss of dword synchronization count: 0


     Phy reset problem count: 0

joeschmuck · Sep 23, 2023

The drive data appears to be okay. Yes, it could be the backplane, however list all your hardware if you would like solid advice vice guesses.

sash · Sep 24, 2023

joeschmuck said:
The drive data appears to be okay. Yes, it could be the backplane, however list all your hardware if you would like solid advice vice guesses.

Here are my full specs:
ASRock SPC741D8UD-2T/X550 (Xeon Sapphire Rapids 6416H), 512 GB ECC RDIMM
HBA: LSI 9305-16i (flashed latest firmware 16.00.12.00)
Samsung SSD 970 PRO 512GB boot-pool
12*20GB ST20000NM002D-WL-FR White Label Drives in RAIDZ-2
2*118GB Intel Optane SSD P1600X NVMe not assigned
2*380GB Optane 905p NVMe not assigned
2*3.84TB Samsung PM1643a SAS SSD - not assigned
Backplane: Supermicro SAS/SATA 12 Bay BPN-SAS3-826EL1 12Gbps
Chassis: Supermicro CSE-826BE16-R920LPB 2U

The motherboard has three 8-pin power connectors, but the PSU has only two cables, so I'm utilizing only two out of three power sources for the motherboard. All four PCIe slots are populated with hardware. Closest to CPU is a Linkreal PCIe X16 to 4 Port M.2 NVMe SSD Adapter that hosts 4 NVMe Optane drives. Then goes 100GB Connect-X4 card, then HBA and then Nvidia RTX2000 GPU.
Sometimes after reboot two P1600X drives dissapear from the system. A reboot helps to bring them back. Also I had hard times getting all the memory recognized by the system. I ended up plugging each dimm separately and booting the system. Then I put all 8 dimms and was able to get the full capacity.
Each time I clean the pool errors and run the scrub I get different drives marked as either degraded or faulted.
HBA has 4 ports. Three I connected to the backplane and 1 to the back of the chassis for two Samsung PM1643A SAS ssds. Backplane has 4 ports total. Two are from HBA or Higher Backplane and two to Lower Backplane in Cascaded System. Should I use only two cables between HBA and backplane?

joeschmuck · Sep 24, 2023

sash said:
The motherboard has three 8-pin power connectors, but the PSU has only two cables, so I'm utilizing only two out of three power sources for the motherboard.

I would recommend you purchase a 4 pin molex to 8 pin adapter, or two 4 pin Molex to 4 pin adapters. I doubt this is your issue but for stability you should provide the additional power required, unless the user manual states otherwise.

sash said:
Nvidia RTX2000 GPU

Why do you have this power hungry video card? Just curious. If this is just for the NAS screen and you really do not need a high powered graphics card, I'd opt for a lighter weight card. Again, likely not your issue.

sash said:
Also I had hard times getting all the memory recognized by the system. I ended up plugging each dimm separately and booting the system. Then I put all 8 dimms and was able to get the full capacity.

This could be a result of the missing 8 pin power connector.

On the SAS/SATA backplane, do you have all three Molex power connectors attached?

sash said:
Three I connected to the backplane and 1 to the back of the chassis for two Samsung PM1643A SAS ssds. Backplane has 4 ports total.

So I'm not very familiar with these backplanes but I can download and read the manual for it. You should ONLY have ports 1 and 2 connected to the HBA. You should not have anything connected to ports 3 and 4 unless you are driving another backplane with it. I do think this is your problem, or at least not a good situation.

sash said:
Should I use only two cables between HBA and backplane?

Yes. Go to the Supermicro website and grab a copy of the manual.

EDIT: Thanks for providing all that information, I think it really helped.

sash · Sep 24, 2023

joeschmuck
Thank you for taking the time. The GPU is there for future Plex transcoding. Backplane does have all three Molex power connectors connected.
I will unplug 1 sas cable as you suggest so only two sas cables connect HBA and the backplane. The third sas cable from HBA connects to MCP-220-82609-0N rear 2 X 2.5" HDD drive kit with two Samsung SAS ssds in it. So HBA connects two backplanes.

Etorix · Sep 24, 2023

joeschmuck said:
I would recommend you purchase a 4 pin molex to 8 pin adapter, or two 4 pin Molex to 4 pin adapters. I doubt this is your issue but for stability you should provide the additional power required, unless the user manual states otherwise.

Actually, filled up but under-powered PCIe slots could explain why some of the power-hungry Optane drives occasionally disappear.
The motherboard has no ATX power connector (no real estate for it!), so all three 8-pin connectors are likely required.

What are these drives intended for, by the way?

sash · Sep 24, 2023

Etorix said:
Actually, filled up but under-powered PCIe slots could explain why some of the power-hungry Optane drives occasionally disappear.
The motherboard has no ATX power connector (no real estate for it!), so all three 8-pin connectors are likely required.

What are these drives intended for, by the way?

Idea was to have a storage large enough for years to come to host all my personal data, run some VMs that I currently have on ESXi server and host a plex library that is ever growing. The drives keep changing statuses after every scrub operation. I am beginning to think it is the backplane... Chassis is the only used part I got. All the rest of the components are brand new. With the current PSU in the system there is no way to get the 3rd 8-pin power motherboard connector operational. I will have to change the PSU/Chassis for that, which I would like to avoid. Too much money already spent)

Ericloewe · Sep 24, 2023

sash said:
The third sas cable from HBA connects to MCP-220-82609-0N rear 2 X 2.5" HDD drive kit with two Samsung SAS ssds in it. So HBA connects two backplanes.

Wait, SAS SSDs? Why on Earth? They only make sense if you're stuck with a system that won't easily take NVMe SSDs, which is not the case. There's an NVMe rear cage kit as well as a front backplane with some U.2 bays, plus lots of space for alternative solutions.

sash · Sep 24, 2023

Ericloewe said:
Wait, SAS SSDs? Why on Earth? They only make sense if you're stuck with a system that won't easily take NVMe SSDs, which is not the case. There's an NVMe rear cage kit as well as a front backplane with some U.2 bays, plus lots of space for alternative solutions.

I'm not aware of 20TB NVMe drives for the price of SAS drives that I got. For the rear sas cage I could not find an NVMe alternative. Please point me in the right direction if you know of one.

UPD: Google search gives me this part. I am confused how NVMe says that support is for SATA/SAS drives.

Supermicro MCP-220-82619-0N NVMe Rear Drive kit Supports 2x 2.5in SAS/SATA Drives

SAS/SATA NVMe Rear Drive kit for 2 x 2.5in Drives Hot-swap For 216B/826B/417B/846X/847B - RoHS

www.wiredzone.com

Ericloewe · Sep 24, 2023

I assume the SAS/SATA part is a typo and that these are simple U.2 PCIe-only bays, but I do not have hands-on experience.

Etorix · Sep 24, 2023

sash said:
With the current PSU in the system there is no way to get the 3rd 8-pin power motherboard connector operational.

Then I'm afraid there's a fatal mismatch between the Supermicro PSU and the AsRockRack motherboard, with its non-standard power intake.
There's no point trying to run TrueNAS if the hardware is not stable.

Ericloewe said:
I assume the SAS/SATA part is a typo and that these are simple U.2 PCIe-only bays, but I do not have hands-on experience.

Post #10 links to Supermicro's website, which states "SAS3/SATA3". I suspect that Wiredzone uses "NVMe" to mean "SSD".

Ericloewe · Sep 25, 2023

Etorix said:
Post #10 links to Supermicro's website, which states "SAS3/SATA3". I suspect that Wiredzone uses "NVMe" to mean "SSD".

Supermicro does/did have an NVMe version, but I'm not finding it on their website at the moment. Doesn't mean much, it's always a bit of a pain to find things there...

sash · Sep 25, 2023

Etorix said:
Then I'm afraid there's a fatal mismatch between the Supermicro PSU and the AsRockRack motherboard, with its non-standard power intake.
There's no point trying to run TrueNAS if the hardware is not stable.

Post #10 links to Supermicro's website, which states "SAS3/SATA3". I suspect that Wiredzone uses "NVMe" to mean "SSD".

Could a splitter like this work?

https://www.amazon.com/Splitter-TeamProfitcom-Motherboard-Y-Splitter-Extension/dp/B07W9H9GNF

As far as NVMe, the part numbers for SATA/SAS and NVMe drive kits are different. So it is a good chance that it is indeed NVMe inclosure. I cannot find it on the Supermicro store though. Weird...
MCP-220-82616-0N vs MCP-220-82619-0N

Another option is to replace the motherboard with this one: SPC741D8-2L2T/BCM. It has only two 8-pin power connectors and my PSU will work for sure. But that is another 700+ usd which I would like to avoid as much as I can)

Etorix · Sep 25, 2023

sash said:
Could a splitter like this work?

If and only if the PSU can safely supply enough power through a single connector.

sash said:
As far as NVMe, the part numbers for SASA/SAS and NVMe drive kits are different. So it is a good chance that it is indeed NVMe inclosure. I cannot find it on the Supermicro store though. Weird...
MCP-220-82616-0N vs MCP-220-82619-0N

Oh! We had '609, '619, and now '616…
To clarify, you may directly ask Supermicro support about these part numbers.

sash said:
Another option is to replace the motherboard with this one: SPC741D8-2L2T/BCM. It has only two 8-pin power connectors and my PSU will work for sure. But that is another 700+ usd which I would like to avoid as much as I can)

This looks like a safer option, but I understand the pain. (TrueNAS may not like the Broadcom NIC… which you won't use anyway.)
An alternative would be to swap the PSU for whatever AsRockRock uses in rackmount servers with the Deep Micro-ATX board… but the ASR PSU likely does NOT fit well (if at all!) in the SM chassis.

sash · Sep 25, 2023

Etorix said:
If and only if the PSU can safely supply enough power through a single connector.

Oh! We had '609, '619, and now '616…
To clarify, you may directly ask Supermicro support about these part numbers.

This looks like a safer option, but I understand the pain. (TrueNAS may not like the Broadcom NIC… which you won't use anyway.)
An alternative would be to swap the PSU for whatever AsRockRock uses in rackmount servers with the Deep Micro-ATX board… but the ASR PSU likely does NOT fit well (if at all!) in the SM chassis.

What about using the built-in MCIO connectors on the motherboard? Can I try to connect them to the backplane with this cable for example?

‌MCIO x8 74 Pin ( NVMe) to 2 X Mini SAS SFF-8643 Cable - 50 CM

‌MCIO x8 74pin ( NVMe) to 2x SFF-8643 NVMe 50 cm "MCIO x8 74 Pin (NVMe) to 2x SFF-8643 NVMe 50 CM" cable is a high-performance data cable that enables the connection of NVMe storage devices to NVMe-capable controllers. It supports high-speed data transfer and low latency, making it ideal for...

www.microsatacables.com

Will this work with my SAS HDDs?

UPD: There is also MCP-220-82620-0N in existance as well)

Important Announcement for the TrueNAS Community.

Failed and Degraded disks

Dabbler

Dabbler

Dabbler

Old Man

Wizard

Dabbler

Old Man

Dabbler

Old Man

Dabbler

Wizard

Dabbler

Server Wrangler

Dabbler

Server Wrangler

Wizard

Server Wrangler

Dabbler

Wizard

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Failed and Degraded disks"

Similar threads