scrub causes errormsgs for multiple disks

deepblue · Mar 17, 2015

Hi all,

I am new to FreeNAS but I have some knowledge on ZFS. I have a SuperMicro 847E16-R1K28LPB Chassis with X9SRL-F Board (INTEL Xeon E5-2620 V, 64GB ECC Registrered) and a LSI 8-Port 9207-8i HBA (IT Mode, Firmware Version 19). The LSI in connected to a 36 Port SAS Expander (LSI SAS2X36). I am running FreeNAS-9.3-STABLE-201503071634.

I have 11x2TB SATA Disks (2x5 HDD RaidZ plus one hotspare) making up one ZFS pool and another 8 SAS drives for a second ZFS pool. During scrubs I see these messages in the logs:

Code:

freenasbox  (da4:mps0:0:12:0): READ(10). CDB: 28 00 05 6d 10 d8 00 01 00 00 length 131072 SMID 613 command timeout cm 0xffffff8000f73168 ccb 0xfffffe00161b0800
freenasbox  (noperiph:mps0:0:4294967295:0): SMID 3 Aborting command 0xffffff8000f73168
freenasbox  (da4:mps0:0:12:0): READ(10). CDB: 28 00 05 6d 11 d8 00 00 98 00 length 77824 SMID 270 command timeout cm 0xffffff8000f579f0 ccb 0xfffffe0dc435c800
freenasbox  (da4:mps0:0:12:0): READ(10). CDB: 28 00 05 6d 11 d8 00 00 98 00 length 77824 SMID 270 terminated ioc 804b scsi 0 state c xfer 0
freenasbox (da4:mps0:0:12:0): READ(10). CDB: 28 00 05 6d 10 d8 00 01 00 00 
freenasbox (da4:mps0:0:12:0): CAM status: Command timeout
freenasbox (da4:mps0:0:12:0): Retrying command
freenasbox (da4:mps0:0:12:0): READ(10). CDB: 28 00 05 6d 11 d8 00 00 98 00 
freenasbox (da4:mps0:0:12:0): CAM status: SCSI Status Error
freenasbox (da4:mps0:0:12:0): SCSI status: Check Condition
freenasbox (da4:mps0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
freenasbox (da4:mps0:0:12:0): Retrying command (per sense data)
freenasbox  (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 49 60 00 00 40 00 length 32768 SMID 347 command timeout cm 0xffffff8000f5dc98 ccb 0xfffffe09291eb800
freenasbox  (noperiph:mps0:0:4294967295:0): SMID 4 Aborting command 0xffffff8000f5dc98
freenasbox  (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 49 20 00 00 40 00 length 32768 SMID 659 command timeout cm 0xffffff8000f76c58 ccb 0xfffffe001617f000
freenasbox  (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 4f 80 00 00 08 00 length 4096 SMID 609 command timeout cm 0xffffff8000f72c48 ccb 0xfffffe00161a7800
freenasbox  (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 4f 80 00 00 08 00 length 4096 SMID 609 terminated ioc 804b scsi 0 state c xfer 0
freenasbox  (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 49 20 00 00 40 00 length 32768 SMID 659 terminated ioc 804b scsi 0 state c xfer 0
freenasbox (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 49 60 00 00 40 00 
freenasbox (da10:mps0:0:18:0): CAM status: Command timeout
freenasbox (da10:mps0:0:18:0): Retrying command
freenasbox (da10:mps0:0:18:0): READ(10). CDB: 28 00 05 6d 4f 80 00 00 08 00 
freenasbox (da10:mps0:0:18:0): CAM status: SCSI Status Error
freenasbox (da10:mps0:0:18:0): SCSI status: Check Condition
freenasbox (da10:mps0:0:18:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
freenasbox (da10:mps0:0:18:0): Retrying command (per sense data)
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 46 b8 00 00 88 00 length 69632 SMID 820 command timeout cm 0xffffff8000f83aa0 ccb 0xfffffe001617f000
freenasbox  (noperiph:mps0:0:4294967295:0): SMID 5 Aborting command 0xffffff8000f83aa0
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 44 38 00 00 c0 00 length 98304 SMID 797 command timeout cm 0xffffff8000f81d28 ccb 0xfffffe004de2e800
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 5a e0 00 00 08 00 length 4096 SMID 778 command timeout cm 0xffffff8000f804d0 ccb 0xfffffe0db10ac000
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 5b f0 00 00 08 00 length 4096 SMID 817 command timeout cm 0xffffff8000f836c8 ccb 0xfffffe0aaffa0800
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 64 40 00 00 08 00 length 4096 SMID 136 command timeout cm 0xffffff8000f4ce40 ccb 0xfffffe02d5cdf000
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 64 40 00 00 08 00 length 4096 SMID 136 terminated ioc 804b scsi 0 state c xfer 0
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 5b f0 00 00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c xfer 0
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 5a e0 00 00 08 00 length 4096 SMID 778 terminated ioc 804b scsi 0 state c xfer 0
freenasbox  (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 44 38 00 00 c0 00 length 98304 SMID 797 terminated ioc 804b scsi 0 state c xfer 0
freenasbox (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 46 b8 00 00 88 00 
freenasbox (da1:mps0:0:9:0): CAM status: Command timeout
freenasbox (da1:mps0:0:9:0): Retrying command
freenasbox (da1:mps0:0:9:0): READ(10). CDB: 28 00 bb 5a 64 40 00 00 08 00 
freenasbox (da1:mps0:0:9:0): CAM status: SCSI Status Error
freenasbox (da1:mps0:0:9:0): SCSI status: Check Condition
freenasbox (da1:mps0:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
freenasbox (da1:mps0:0:9:0): Retrying command (per sense data)

As you can see this does not boil down to one specific drive, but as far as I can see it, only the SATA disks are affected. This happens only if the ZFS pool with the SATA disks is scrubbed. The scrub finishes with no errors.
Disks in the pool are 10xWD + 1xSeagate (the Hot-Spare is a WD drive, the Seagate is in use).

My question is: What is most likely causing these errors? Could this be a general problem of one or more SATA disks running on a SAS expander? I heard rumors that this is not the best setup one could think of.
What could I do to debug this?

Thanx a lot for any hints...

marbus90 · Mar 17, 2015

Flash your HBA to the P16 version, it's supported better on FreeNAS up to 9.3.

deepblue · Mar 17, 2015

Thanx for the input. Although I doubt that downgrading the Firmware to P16 is so much of an improvement I will give it a try.

Ericloewe · Mar 17, 2015

deepblue said:
Thanx for the input. Although I doubt that downgrading the Firmware to P16 is so much of an improvement I will give it a try.

It is the only supported option.

deepblue · Mar 18, 2015

Hi,

Ok, ok. I downgraded the Firmware to P16 yesterday (which was not to easy as the only option for downgrading that worked for me was through UEFI Shell and the Server is some miles away). I also upgraded to the latest FreeNAS Release (FreeNAS-9.3-STABLE-201503170439).

This morning I started a scrub on the pool. After ~1hour I see this message in the Logs:

Code:

freenasbox kernel:  (da10:mps0:0:18:0): READ(10). CDB: 28 00 bc 7f fc 88 00 00 c0 00 length 98304 SMID 964 command timeout cm 0xffffff8000f8f320 ccb 0xfffffe01590ac800
freenasbox kernel:  (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xffffff8000f8f320
freenasbox kernel:  (da10:mps0:0:18:0): READ(10). CDB: 28 00 bc 7f fd c8 00 01 00 00 length 131072 SMID 913 command timeout cm 0xffffff8000f8b1c8 ccb 0xfffffe0018391000
freenasbox kernel:  (da10:mps0:0:18:0): READ(10). CDB: 28 00 bc 7f fd c8 00 01 00 00 length 131072 SMID 913 terminated ioc 804b scsi 0 state c xfer 0
freenasbox kernel: (da10:mps0:0:18:0): READ(10). CDB: 28 00 bc 7f fc 88 00 00 c0 00
freenasbox kernel: (da10:mps0:0:18:0): CAM status: Command timeout
freenasbox kernel: (da10:mps0:0:18:0): Retrying command
freenasbox kernel: (da10:mps0:0:18:0): READ(10). CDB: 28 00 bc 7f fd c8 00 01 00 00
freenasbox kernel: (da10:mps0:0:18:0): CAM status: SCSI Status Error
freenasbox kernel: (da10:mps0:0:18:0): SCSI status: Check Condition
freenasbox kernel: (da10:mps0:0:18:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
freenasbox kernel: (da10:mps0:0:18:0): Retrying command (per sense data)

I will wait for the scrub to finish (this will take another 8 hours), but my impression is, that there are fewer logmessages and up to now there is only one disk being complained about.... Lets see...

marbus90 · Mar 18, 2015

After that scrub is done, please pastebin the smartctl -x outputs of all disks which had such errors.

deepblue · Mar 18, 2015

The scrub is finished and I see a lot of errors. Besides the controller errors in /var/log/messages, the zpool scrub reported one CRC error on one disk.

Here are the controller messages in /var/log/messages:
http://pastebin.com/BWngQtuV

Here is the output of smartctl -x for every affected disk:
http://pastebin.com/BvpB12bJ

For 7 out of 10 SATA disks in the scrubbed pool there are controller messages in the log. And there are also messages for 3 SAS disks (da14, da17, da18) that are in another pool which was not scrubbed....

Any ideas?

Ericloewe · Mar 18, 2015

At this point, I'd bet on the expander or the controller. I'd suggest you try a few drives directly attached to see what to happens. Or replace either the controller or expander and see what happens, of course.

deepblue · Mar 18, 2015

Great Input, thanks. I assume there is no way to directly attach disks to the controller as the expander is mounted on the backplane, but I will investigate on that.... And I will talk to the hardware vendor and RMA the Controller and cabeling as a first try. If this does not help, I will change expander/backplane...

Thanks

deepblue · Apr 20, 2015

Hi all,

I took me some weeks, but after replacing the LSI Controller the problems persisted. I decided, that the probleme was caused by one or more old SATA-2 disks (3 Gb/s) connected to the backplane/expander :)

So I replaced all the SATA disks by 6 new 4TB SAS drives (RAID-Z + 1 hotspare). Now I have ~7.5 TB data on the pool and started a new scrub. The scrub is running now for ~4 hours at 425MBytes/sec (very impressive) and there are no errors related to disk read/write errors/problems. The only thing in the logs is this:

Code:

ses0: da4,pass4: Element descriptor: 'Slot 05'
ses0: da4,pass4: SAS Device Slot Element: 1 Phys at Slot 4
ses0:  phy 0: SAS device type 1 id 0
ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
ses0:  phy 0: parent 50030480013518bf addr 50000c0f01faa972
ses0: da0,pass0: Element descriptor: 'Slot 06'
ses0: da0,pass0: SAS Device Slot Element: 1 Phys at Slot 5
ses0:  phy 0: SAS device type 1 id 0
ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
ses0:  phy 0: parent 50030480013518bf addr 50000c0f01fa8172
ses0: da5,pass5: Element descriptor: 'Slot 11'
ses0: da5,pass5: SAS Device Slot Element: 1 Phys at Slot 10
ses0:  phy 0: SAS device type 1 id 0
ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
ses0:  phy 0: parent 50030480013518bf addr 50000c0f01fac052
ses0: da1,pass1: Element descriptor: 'Slot 12'
ses0: da1,pass1: SAS Device Slot Element: 1 Phys at Slot 11
ses0:  phy 0: SAS device type 1 id 0
ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
ses0:  phy 0: parent 50030480013518bf addr 50000c0f01fac616
ses0: da2,pass2: Element descriptor: 'Slot 18'
ses0: da2,pass2: SAS Device Slot Element: 1 Phys at Slot 17
ses0:  phy 0: SAS device type 1 id 0
ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
ses0:  phy 0: parent 50030480013518bf addr 50000c0f01f1e09a
ses0: da3,pass3: Element descriptor: 'Slot 24'
ses0: da3,pass3: SAS Device Slot Element: 1 Phys at Slot 23
ses0:  phy 0: SAS device type 1 id 0
ses0:  phy 0: protocols: Initiator( None ) Target( SSP )
ses0:  phy 0: parent 50030480013518bf addr 50000c0f01faacc6

Is think this is not a error message, but I am not really sure... da0 - da5 are the 6 new drives I installed. I configured smart tests for these drives in the freenas UI about at the time when these messages appeared in the logs...

If the errors are really gone, this is another example of "you should not use crappy SATA drives on SAS expanders..."

Regards

marbus90 · Apr 20, 2015

Eugh... why did you not use raidz2?

deepblue · Apr 20, 2015

I wanted a spare disk as there are 2 more pools in this machine without spares at the moment, so 5+1 seemed the better choice for me than a raidz2. This pool holds backupdata only...

Important Announcement for the TrueNAS Community.

scrub causes errormsgs for multiple disks

deepblue

Cadet

marbus90

Guru

deepblue

Cadet

Ericloewe

Server Wrangler

deepblue

Cadet

marbus90

Guru

deepblue

Cadet

Ericloewe

Server Wrangler

deepblue

Cadet

deepblue

Cadet

marbus90

Guru

deepblue

Cadet

Similar threads