Error creating pool on Dell R740 - CAM status: CCB request completed with an error

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
Hi,

I would really appreciate some help with this.

I've just installed TrueNAS 12.0-U4, basic server specs are below. I'm getting random IO errors when trying to create a new pool. Each time I try to create the pool, a different disk will fail and I get the following error.

Code:
[EFAULT] Failed to wipe disk da5: [EFAULT] Command dd if=/dev/zero of=/dev/da5p2 bs=1M count=32 failed (code 1): dd: /dev/da5p2: Input/output error 19+0 records in 18+0 records out 18874368 bytes transferred in 0.335980 secs (56177141 bytes/sec)


In the console logs I have the following messages:

Code:
Jun  7 10:39:04 truenas (da1:mpr0:0:9:0): WRITE(10). CDB: 2a 00 00 40 00 80 00 08 00 00
Jun  7 10:39:04 truenas (da1:mpr0:0:9:0): CAM status: CCB request completed with an error
Jun  7 10:39:04 truenas (da1:mpr0:0:9:0): Error 5, Retries exhausted
Jun  7 10:39:04 truenas mpr0: Controller reported scsi ioc terminated tgt 9 SMID 439 loginfo 3112010c
Jun  7 10:39:05 truenas mpr0: Controller reported scsi ioc terminated tgt 12 SMID 510 loginfo 31120302


I've tried running this command dd if=/dev/zero of=/dev/da5p2 bs=1M count=32 in the shell and sometimes a disk will pass and the next time the same disk will fail. Same thing if I try to use the Wipe the disk feature in the GUI.

Server Specs:
Dell R740xd2
2 x Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
320GB ECC memory, 2.4GHz
HBA330 running firmware 16.17.01
Storage drives are WD 18TB SAS, dell part number TH0R20GGHGT0012R0AWSA00, rev US05

We have another identical R740 running TrueNAS 12.0 no issues, only differences seems to be 12TB disks and the HBA330 is on firmware 16.17.00.05.

Dell support hasn't been much help here yet. If anyone has some suggestions, it would be greatly appreciated.
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
I've also noticed something odd, maybe someone can comment on this.

When trying to create a new pool, the GUI shows the command that fails which I previously posted. If I run that command in the shell it fails with an input/output error and the console log shows a CAM status error

Command dd if=/dev/zero of=/dev/da5p2 bs=1M count=32

If I change that device from /dev/da5p2 to just da5 dd if=/dev/zero of=da5 bs=1M count=32
It works every time, no errors in the console.
 

phebert

Dabbler
Joined
Jun 4, 2021
Messages
15
Have you checked the SMART stats on da5? Post results here:

Code:
smartctl -a /dev/da5
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
Here are the results

Code:
root@truenas[~]# smartctl -a /dev/da5
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WUH721818AL5200
Revision:             US05
Compliance:           SPC-5
User Capacity:        18,000,207,937,536 bytes [18.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2af095fc4
Serial number:        3GG54UBE
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Jun  7 13:11:31 2021 PDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     30 C
Drive Trip Temperature:        50 C

Accumulated power on time, hours:minutes 112:24
Manufactured in week 08 of year 2021
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  20
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  23
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          3          0.072  0
write:         0        0         0         0          5          0.322  0
verify:        0        0         0         0         10          0.004  0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Reserved(7)       Completed                  80      18                 - [-   -    -]
# 2  Background short  Completed                  96      16                 - [-   -    -]
 

phebert

Dabbler
Joined
Jun 4, 2021
Messages
15
Drive looks ok to me. It only a year old, likely under warranty also. This issue would confuse me also.
Maybe someone with more Drive/Linux knowledge can assist.
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
My drive da5 was just one example, it's the same deal for all the drives. This server is brand new, it went in the rack yesterday.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Hi,

I would really appreciate some help with this.

I've just installed TrueNAS 12.0-U4, basic server specs are below. I'm getting random IO errors when trying to create a new pool. Each time I try to create the pool, a different disk will fail and I get the following error.

Code:
[EFAULT] Failed to wipe disk da5: [EFAULT] Command dd if=/dev/zero of=/dev/da5p2 bs=1M count=32 failed (code 1): dd: /dev/da5p2: Input/output error 19+0 records in 18+0 records out 18874368 bytes transferred in 0.335980 secs (56177141 bytes/sec)


In the console logs I have the following messages:

Code:
Jun  7 10:39:04 truenas (da1:mpr0:0:9:0): WRITE(10). CDB: 2a 00 00 40 00 80 00 08 00 00
Jun  7 10:39:04 truenas (da1:mpr0:0:9:0): CAM status: CCB request completed with an error
Jun  7 10:39:04 truenas (da1:mpr0:0:9:0): Error 5, Retries exhausted
Jun  7 10:39:04 truenas mpr0: Controller reported scsi ioc terminated tgt 9 SMID 439 loginfo 3112010c
Jun  7 10:39:05 truenas mpr0: Controller reported scsi ioc terminated tgt 12 SMID 510 loginfo 31120302


I've tried running this command dd if=/dev/zero of=/dev/da5p2 bs=1M count=32 in the shell and sometimes a disk will pass and the next time the same disk will fail. Same thing if I try to use the Wipe the disk feature in the GUI.

Server Specs:
Dell R740xd2
2 x Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
320GB ECC memory, 2.4GHz
HBA330 running firmware 16.17.01
Storage drives are WD 18TB SAS, dell part number TH0R20GGHGT0012R0AWSA00, rev US05

We have another identical R740 running TrueNAS 12.0 no issues, only differences seems to be 12TB disks and the HBA330 is on firmware 16.17.00.05.

Dell support hasn't been much help here yet. If anyone has some suggestions, it would be greatly appreciated.
You may be experiencing this bug (which is not just limited to Seagate drives):


FWIW, there's a lengthy thread on servethehome.com with details about the HBA330:


Version 16.17.01.00 is the latest firmware I could find on Dell's support site.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I don't know that the "HBA330" is a true HBA.
It is, LSI SAS3008 with semi-custom Dell firmware to support the iDrac stuff. Of course, standard disclaimers apply since Dell has fiddled with it, but it's a step up from most such things.
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
It is, LSI SAS3008 with semi-custom Dell firmware to support the iDrac stuff. Of course, standard disclaimers apply since Dell has fiddled with it, but it's a step up from most such things.

This server does have the SAS3008, and this is flashed with the non raid firmware.

I'm pretty sure these 18TB WD drives use SMR, is that possibly the cause of all these CAM status error?

The SSD drive I have attached to the HBA is good, no issues at all making a pool. Seems to be only the 18TB drives that won't cooperate.

I've downgraded the firmware on the HBA to the previous version, which is working successfully for us another identical server using 12TB drives. Firmware version made no difference so I flashed it back to the latest version.

I've updated the system BIOS and the firmware for the backplane, made no difference for me.

I decided to try TrueNAS 12.0, and with the 12.0 release I'm able to create a pool, but the console logs are full of cam status errors and retries while it's creating the pool. I updated the system back to 12 U4 and again pool creation immediately fails.


Code:
root@truenas[~]# camcontrol devlist
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (pass0,ses0)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus15 target 0 lun 0 (pass1,ses1)
<WDC WUH721818AL5200 US05>         at scbus16 target 8 lun 0 (pass2,da0)
<WDC WUH721818AL5200 US05>         at scbus16 target 9 lun 0 (pass3,da1)
<WDC WUH721818AL5200 US05>         at scbus16 target 10 lun 0 (pass4,da2)
<WDC WUH721818AL5200 US05>         at scbus16 target 11 lun 0 (pass5,da3)
<WDC WUH721818AL5200 US05>         at scbus16 target 12 lun 0 (pass6,da4)
<WDC WUH721818AL5200 US05>         at scbus16 target 13 lun 0 (pass7,da5)
<WDC WUH721818AL5200 US05>         at scbus16 target 14 lun 0 (pass8,da6)
<WDC WUH721818AL5200 US05>         at scbus16 target 15 lun 0 (pass9,da7)
<WDC WUH721818AL5200 US05>         at scbus16 target 16 lun 0 (pass10,da8)
<WDC WUH721818AL5200 US05>         at scbus16 target 17 lun 0 (pass11,da9)
<WDC WUH721818AL5200 US05>         at scbus16 target 18 lun 0 (pass12,da10)
<WDC WUH721818AL5200 US05>         at scbus16 target 19 lun 0 (pass13,da11)
<WDC WUH721818AL5200 US05>         at scbus16 target 20 lun 0 (pass14,da12)
<WDC WUH721818AL5200 US05>         at scbus16 target 21 lun 0 (pass15,da13)
<WDC WUH721818AL5200 US05>         at scbus16 target 22 lun 0 (pass16,da14)
<WDC WUH721818AL5200 US05>         at scbus16 target 23 lun 0 (pass17,da15)
<WDC WUH721818AL5200 US05>         at scbus16 target 24 lun 0 (pass18,da16)
<WDC WUH721818AL5200 US05>         at scbus16 target 25 lun 0 (pass19,da17)
<WDC WUH721818AL5200 US05>         at scbus16 target 26 lun 0 (pass20,da18)
<WDC WUH721818AL5200 US05>         at scbus16 target 27 lun 0 (pass21,da19)
<WDC WUH721818AL5200 US05>         at scbus16 target 28 lun 0 (pass22,da20)
<WDC WUH721818AL5200 US05>         at scbus16 target 29 lun 0 (pass23,da21)
<WDC WUH721818AL5200 US05>         at scbus16 target 30 lun 0 (pass24,da22)
<WDC WUH721818AL5200 US05>         at scbus16 target 31 lun 0 (pass25,da23)
<WDC WUH721818AL5200 US05>         at scbus16 target 32 lun 0 (pass26,da24)
<ATA SSDSC2KB019T8R DL69>          at scbus16 target 33 lun 0 (pass27,da25)
<DP BP14G+EXP 2.52>                at scbus16 target 34 lun 0 (pass28,ses2)
<DELLBOSS VD MV.R00-0>             at scbus17 target 0 lun 0 (pass29,ada0)
<Marvell Console 1.01>             at scbus19 target 0 lun 0 (pass30)
root@truenas[~]# zpool list -v
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool   428G  1.41G   427G        -         -     0%     0%  1.00x    ONLINE  -
  ada0p2    428G  1.41G   427G        -         -     0%  0.32%      -    ONLINE
root@truenas[~]# glabel status
                                      Name  Status  Components
gptid/bc372921-c897-11eb-9b99-2cea7f932a92     N/A  ada0p1
gptid/3d24d754-c542-11eb-a009-2cea7f932a92     N/A  da9p2
gptid/fd99edb7-c542-11eb-a009-2cea7f932a92     N/A  da10p2
gptid/3d09f81d-c542-11eb-a009-2cea7f932a92     N/A  da11p2
gptid/fef5c180-c542-11eb-a009-2cea7f932a92     N/A  da12p2
gptid/ffee57f8-c542-11eb-a009-2cea7f932a92     N/A  da13p2
gptid/0040a9f8-c543-11eb-a009-2cea7f932a92     N/A  da14p2
gptid/ff5df44d-c542-11eb-a009-2cea7f932a92     N/A  da15p2
gptid/0048f48e-c543-11eb-a009-2cea7f932a92     N/A  da16p2
gptid/00c481ea-c543-11eb-a009-2cea7f932a92     N/A  da17p2
gptid/00fccedc-c543-11eb-a009-2cea7f932a92     N/A  da18p2
gptid/014eb031-c543-11eb-a009-2cea7f932a92     N/A  da19p2
gptid/01c15361-c543-11eb-a009-2cea7f932a92     N/A  da20p2
gptid/01ec9e56-c543-11eb-a009-2cea7f932a92     N/A  da22p2
gptid/023a7cc5-c543-11eb-a009-2cea7f932a92     N/A  da23p2
gptid/4dbb6642-c542-11eb-a009-2cea7f932a92     N/A  da24p2
gptid/f0ce84a3-c7f1-11eb-9e3d-2cea7f932a92     N/A  da25p2
gptid/f0c749fd-c7f1-11eb-9e3d-2cea7f932a92     N/A  da25p1
gptid/4dafe058-c542-11eb-a009-2cea7f932a92     N/A  da24p1
gptid/022e5b68-c543-11eb-a009-2cea7f932a92     N/A  da23p1
gptid/01c9654b-c543-11eb-a009-2cea7f932a92     N/A  da22p1
gptid/01a0f9bd-c543-11eb-a009-2cea7f932a92     N/A  da20p1
gptid/011cf4b5-c543-11eb-a009-2cea7f932a92     N/A  da19p1
gptid/00c700b1-c543-11eb-a009-2cea7f932a92     N/A  da18p1
gptid/00a1f844-c543-11eb-a009-2cea7f932a92     N/A  da17p1
gptid/00327de3-c543-11eb-a009-2cea7f932a92     N/A  da16p1
gptid/ff3e6ba7-c542-11eb-a009-2cea7f932a92     N/A  da15p1
gptid/0015536b-c543-11eb-a009-2cea7f932a92     N/A  da14p1
gptid/ffae5963-c542-11eb-a009-2cea7f932a92     N/A  da13p1
gptid/fee7923a-c542-11eb-a009-2cea7f932a92     N/A  da12p1
gptid/3c977dce-c542-11eb-a009-2cea7f932a92     N/A  da11p1
gptid/fd6f2e98-c542-11eb-a009-2cea7f932a92     N/A  da10p1
gptid/3c7eb068-c542-11eb-a009-2cea7f932a92     N/A  da9p1
gptid/fdacdf2c-c542-11eb-a009-2cea7f932a92     N/A  da7p1
gptid/23ad478d-c894-11eb-9cc9-2cea7f932a92     N/A  da6p1
gptid/23719ab7-c894-11eb-9cc9-2cea7f932a92     N/A  da5p1
gptid/22ea20fc-c894-11eb-9cc9-2cea7f932a92     N/A  da4p1
gptid/232accd7-c894-11eb-9cc9-2cea7f932a92     N/A  da3p1
gptid/23fc18f7-c894-11eb-9cc9-2cea7f932a92     N/A  da2p1
gptid/23843c40-c894-11eb-9cc9-2cea7f932a92     N/A  da1p1
gptid/225a857a-c894-11eb-9cc9-2cea7f932a92     N/A  da0p1
gptid/2270bbd0-c894-11eb-9cc9-2cea7f932a92     N/A  da0p2
gptid/244d28b5-c894-11eb-9cc9-2cea7f932a92     N/A  da2p2
gptid/235baadb-c894-11eb-9cc9-2cea7f932a92     N/A  da3p2
gptid/23efbdb4-c894-11eb-9cc9-2cea7f932a92     N/A  da6p2
gptid/23032c4a-c894-11eb-9cc9-2cea7f932a9

 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I'm pretty sure these 18TB WD drives use SMR, is that possibly the cause of all these CAM status error?
According to Western Digital's spec sheet for the HC550 series, these are CMR drives:

There are several threads here on the forum where users are having similar problems w/ larger disks on LSI-2000 and 3000 series HBAs. I'm wondering if this isn't a FreeBSD bug in the related driver code.
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
According to Western Digital's spec sheet for the HC550 series, these are CMR drives:

There are several threads here on the forum where users are having similar problems w/ larger disks on LSI-2000 and 3000 series HBAs. I'm wondering if this isn't a FreeBSD bug in the related driver code.

I literally just found this as well, I think the 18-20TB drives are the dc-hc650 series. https://documents.westerndigital.co...c600-series/data-sheet-ultrastar-dc-hc650.pdf

But same thing, it says they are CMR drives, so at least I can rule that out.

I imagine a fix for this could take a very long time, I'm starting to think I'll need to find a different storage platform. Not happy about this, FreeNAS/TrueNAS has been awesome for us.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
I literally just found this as well, I think the 18-20TB drives are the dc-hc650 series. https://documents.westerndigital.co...c600-series/data-sheet-ultrastar-dc-hc650.pdf

But same thing, it says they are CMR drives, so at least I can rule that out.

I imagine a fix for this could take a very long time, I'm starting to think I'll need to find a different storage platform. Not happy about this, FreeNAS/TrueNAS has been awesome for us.
Not that it matters, but your WUH721818AL5200 drives seem to be OEM HC550's:


And they have the latest firmware available from Dell (US05), near as I can tell:


Odd, but I have a pair of 14TB DC HC530 (WUH721414AL4204) drives attached to an LSI-3008 HBA and I don't see any of these errors. But I'm running an old version of FreeNAS: 11.2-U8
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
Not that it matters, but your WUH721818AL5200 drives seem to be OEM HC550's:


And they have the latest firmware available from Dell (US05), near as I can tell:


Odd, but I have a pair of 14TB DC HC530 (WUH721414AL4204) drives attached to an LSI-3008 HBA and I don't see any of these errors. But I'm running an old version of FreeNAS: 11.2-U8

Very interesting, I'll try FreeNAS 11.2 U8 as soon as I get a chance and see if it makes any difference.

I doubt it make any difference, but I'm going to have someone pull out the HBA, re-seat it and check the cables, etc. So far that's the only suggestion from Dell support.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Very interesting, I'll try FreeNAS 11.2 U8 as soon as I get a chance and see if it makes any difference.

I doubt it make any difference, but I'm going to have someone pull out the HBA, re-seat it and check the cables, etc. So far that's the only suggestion from Dell support.
It can't hurt!
You might compare BIOS settings between this new machine and your other, working server, too.
 

club79

Dabbler
Joined
Jan 13, 2019
Messages
14
This issue was finally resolved. Turns out the SAS cables in the server were at fault. Dell replaced the cables and it's working normally now.

To troubleshoot this, I ended up installing Ubuntu on the server. I tested each disk individually and found out that only the 12 drives in the front bay were producing the error. The 12 drives in bay 2 and the drives in the server's rear flex bays were good. At this point I figured the back plane on bay 1 was at fault, I had dell replace the backplane which didn't resolve the problem. Cables were replaced next and problem solved.

Dell provided almost no support or troubleshooting assistance at all. Hours and hours of phone time with nothing happening, watching the their techs work remotely over the idrac was painful. They are polite / professional but overall a very disappointing experience. This server was brand new, we have NBD on-site part replacement and it took 2 weeks to get this done.

Thanks to everyone who contributed here, and hopefully this will help someone with a similar issue.
 

Spearfoot

He of the long foot
Moderator
Joined
May 13, 2015
Messages
2,478
Glad you got it working!
 
Top