SOLVED Can't tell if one or two drives failed, or worse...

mrichardson03 · Aug 8, 2016

I've got 6x WD Red 3TB drives in a RAID-Z2 running off the onboard LSI 2308 (
20.00.04.00 IT firmware from SuperMicro) controller on a X10SL7-F, and it's running FreeNAS-9.10-STABLE-201606270534 (before the LSI driver nonsense). I've just had one drive fail, I know. However, I apparently had swap space in use on that drive while it was failing, and in order to get the Web UI back to do the drive replacement, I had to reboot the box. During that reboot, the drive that FreeNAS was telling me had failed changed. Now I'm not sure if I've got two failed drives, or a disk controller that's going on the motherboard.

Before Reboot

Code:

Aug  8 12:54:45 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 1
Aug  8 12:54:45 freenas da1 at mps0 bus 0 scbus0 target 1 lun 0
Aug  8 12:54:45 freenas da1: <ATA WDC WD30EFRX-68A 0A80> s/n      WD-WMC1T1145005 detached
Aug  8 12:54:46 freenas devd: Executing '[ -e /tmp/.sync_disk_done ] && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/sync_disks.py && LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/python /usr/local/www/freenasUI/tools/smart_alert.py -d da1'
Aug  8 12:54:47 freenas         (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 26 c4 33 f0 00 00 40 00 length 32768 SMID 629 terminated ioc 804b scsi 0 state c xfer 0
Aug  8 12:54:47 freenas         (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 26 c7 1e b8 00 00 18 00 length 12288 SMID 792 terminated ioc 804b scsi 0 state c xfer(da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 26 c4 33 f0 00 00 40 00 
Aug  8 12:54:47 freenas 0
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): CAM status: CCB request completed with an error
Aug  8 12:54:47 freenas         (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 26 c7 23 58 00 00 18 00 length 12288 SMID 382 terminated ioc 804b scsi 0 state c xfer(da1: 0
Aug  8 12:54:47 freenas mps0:0:1:0): mps0: Error 5, Periph was invalidated
Aug  8 12:54:47 freenas Unfreezing devq for target ID 1
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 26 c7 1e b8 00 00 18 00 
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): CAM status: CCB request completed with an error
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): Error 5, Periph was invalidated
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): WRITE(10). CDB: 2a 00 26 c7 23 58 00 00 18 00 
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): CAM status: CCB request completed with an error
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): Error 5, Periph was invalidated
Aug  8 12:54:47 freenas GEOM_ELI: Device da1p1.eli destroyed.
Aug  8 12:54:47 freenas GEOM_ELI: Detached da1p1.eli on last close.
Aug  8 12:54:47 freenas (da1:mps0:0:1:0): Periph destroyed
Aug  8 12:54:47 freenas devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=5301505203498458122 vdev_guid=13611148881476578051''
Aug  8 12:54:47 freenas ZFS: vdev is removed, pool_guid=5301505203498458122 vdev_guid=13611148881476578051
Aug  8 12:54:48 freenas swap_pager: I/O error - pagein failed; blkno 532148,size 12288, error 6
Aug  8 12:54:48 freenas vm_fault: pager read error, pid 15364 (zfsd)
Aug  8 12:54:48 freenas swap_pager: I/O error - pagein failed; blkno 532148,size 12288, error 6
Aug  8 12:54:48 freenas vm_fault: pager read error, pid 15364 (zfsd)

The vm_fault messages go on for quite some time. This tells me that da1 is toast.

After Reboot

Code:

Aug  8 13:33:43 freenas da0 at mps0 bus 0 scbus0 target 0 lun 0
Aug  8 13:33:43 freenas da1 at mps0 bus 0 scbus0 target 1 lun 0
Aug  8 13:33:43 freenas da1: <ATA WDC WD30EFRX-68E 0A80> Fixed Direct Access SPC-4 SCSI device
Aug  8 13:33:43 freenas da1: Serial Number      WD-WCC4N0048718
Aug  8 13:33:43 freenas da1: 600.000MB/s transfers
Aug  8 13:33:43 freenas da1: Command Queueing enabled
Aug  8 13:33:43 freenas da1: 2861588MB (5860533168 512 byte sectors)
Aug  8 13:33:43 freenas da1: quirks=0x8<4K>
Aug  8 13:33:43 freenas da3 at mps0 bus 0 scbus0 target 3 lun 0
Aug  8 13:33:43 freenas da0: <ATA WDC WD30EFRX-68E 0A82> Fixed Direct Access SPC-4 SCSI device
Aug  8 13:33:43 freenas da0: Serial Number      WD-WMC4N0F4VZUV
Aug  8 13:33:43 freenas da0: 600.000MB/s transfers
Aug  8 13:33:43 freenas da0: Command Queueing enabled
Aug  8 13:33:43 freenas da0: 2861588MB (5860533168 512 byte sectors)
Aug  8 13:33:43 freenas da0: quirks=0x8<4K>
Aug  8 13:33:43 freenas da2 at mps0 bus 0 scbus0 target 2 lun 0
Aug  8 13:33:43 freenas da2: <ATA WDC WD30EFRX-68E 0A80> Fixed Direct Access SPC-4 SCSI device
Aug  8 13:33:43 freenas da2: Serial Number      WD-WCC4NDA2NCC9
Aug  8 13:33:43 freenas da2: 600.000MB/s transfers
Aug  8 13:33:43 freenas da2: Command Queueing enabled
Aug  8 13:33:43 freenas da2: 2861588MB (5860533168 512 byte sectors)
Aug  8 13:33:43 freenas da2: quirks=0x8<4K>
Aug  8 13:33:43 freenas da4 at mps0 bus 0 scbus0 target 4 lun 0
Aug  8 13:33:43 freenas da3: <ATA WDC WD30EFRX-68A 0A80> Fixed Direct Access SPC-4 SCSI device
Aug  8 13:33:43 freenas da4: <ATA WDC WD30EFRX-68E 0A80> Fixed Direct Access SPC-4 SCSI device
Aug  8 13:33:43 freenas da4: Serial Number      WD-WCC4N0299267
Aug  8 13:33:43 freenas da4: 600.000MB/s transfers
Aug  8 13:33:43 freenas da4: Command Queueing enabled
Aug  8 13:33:43 freenas da4: 2861588MB (5860533168 512 byte sectors)
Aug  8 13:33:43 freenas da4: quirks=0x8<4K>
Aug  8 13:33:43 freenas da3: Serial Number      WD-WMC1T0860631
Aug  8 13:33:43 freenas da3: 600.000MB/s transfers
Aug  8 13:33:43 freenas da3: Command Queueing enabled
Aug  8 13:33:43 freenas da3: 2861588MB (5860533168 512 byte sectors)
Aug  8 13:33:43 freenas da3: quirks=0x8<4K>

After the reboot, da1 is back, but now da5 is gone. I've got a replacement disk for da5 getting reslivered in an eSATA dock, and the serial of the disk that was failing before the reboot isn't in the logs anymore. I'm not sure if the controller would rename the drives in this scenario. However, a short SMART test on da1 hasn't come back after 20+ minutes...

m0nkey_ · Aug 8, 2016

Can you run the following and post the result using the [ code ] tags?

Code:

camcontrol devlist

mrichardson03 · Aug 8, 2016

Code:

<ATA WDC WD30EFRX-68E 0A82>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA WDC WD30EFRX-68E 0A80>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD30EFRX-68E 0A80>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA WDC WD30EFRX-68A 0A80>        at scbus0 target 3 lun 0 (pass3,da3)
<ATA WDC WD30EFRX-68E 0A80>        at scbus0 target 4 lun 0 (pass4,da4)
<SATA SSD S9FM02.1>                at scbus1 target 0 lun 0 (ada0,pass5)
<WDC WD30EFRX-68EUZN0 80.00A80>    at scbus2 target 0 lun 0 (ada1,pass6)

mrichardson03 · Aug 8, 2016

So it looks like the the controller renamed the disks during the last boot of the system, and not in any order that makes sense. Is this normal? I can see them bumping up to adjust for the failure of da1, but not what happened.

Boot 8/4 @ 12:11:26 (pre-failure)

da0: WMC4N0F4VZUV
da1: WMC1T1145005
da2: WCC4NDA2NCC9
da3: WCC4N0048718
da4: WMC1T0860631
da5: WCC4N0299267

Boot 8/4 @ 12:23:28 (pre-failure)

All drives the same.

Boot 8/8 @ 13:33:43

da0: WMC4N04VZUV (same)
da1: WCC4N0048718 (old da3)
da2: WMC1T0860631 (old da4)
da3: WCC4NDA2NCC9 (old da2)
da4: WCC4N0299267 (old da5)

Stux · Aug 8, 2016

The drive names are for all intents, random.

CraigD · Aug 8, 2016

run " pool status" it will tell you

Yours will show degraded pool and the status of all pools and drive(s) will be listed

I just run the command and here are my results (I am currently resilvering myself)

Have Fun

Code:

[root@freenas ~]# zpool status                                                                                                     
  pool: RaidA                                                                                                                       
state: ONLINE                                                                                                                     
status: One or more devices is currently being resilvered.  The pool will                                                           
        continue to function, possibly in a degraded state.                                                                         
action: Wait for the resilver to complete.                                                                                         
  scan: resilver in progress since Tue Aug  9 10:02:27 2016                                                                         
        631G scanned out of 10.6T at 252M/s, 11h34m to go                                                                           
        77.0G resilvered, 5.80% done                                                                                               
config:                                                                                                                             
                                                                                                                                   
        NAME                                            STATE     READ WRITE CKSUM                                                 
        RaidA                                           ONLINE       0     0     0                                                 
          raidz2-0                                      ONLINE       0     0     0                                                 
            gptid/72ca6241-27df-11e6-aef1-0cc47aab6f2a  ONLINE       0     0     0                                                 
            gptid/736fcc94-27df-11e6-aef1-0cc47aab6f2a  ONLINE       0     0     0                                                 
            gptid/74b9712e-27df-11e6-aef1-0cc47aab6f2a  ONLINE       0     0     0                                                 
            gptid/762ab4e9-27df-11e6-aef1-0cc47aab6f2a  ONLINE       0     0     0                                                 
            gptid/c16a9405-5db3-11e6-95c1-0cc47aab6f2a  ONLINE       0     0     0  (resilvering)                                   
            gptid/786b161e-27df-11e6-aef1-0cc47aab6f2a  ONLINE       0     0     0                                                 
            gptid/79002fa7-27df-11e6-aef1-0cc47aab6f2a  ONLINE       0     0     0                                                 
            gptid/7830392b-5bbb-11e6-94a0-0cc47aab6f2a  ONLINE       0     0     0                                                 
                                                                                                                                   
errors: No known data errors

SweetAndLow · Aug 8, 2016

the daX drive labels are meaningless for identifying a drive. They are randomly assigned during boot. No need to mess with the cli to replace your drive. Just go into the gui and use volume status and view disks tabs to find your disk and replace it. You shouldn't be using the cli to replace a disk or find out what disk needs replacement.

mrichardson03 · Aug 9, 2016

So what's the best way to know what drive you're replacing? Manually record all the serial numbers ahead of time? Is the description field in Storage -> View Disks for a disk tied to that specific disk (i.e. serial number) or drive label? I'd like to use the description field to specify "Slot 1" to "Slot 6" to correspond to their physical location in the case, since apparently you can't go off of what the disk controller thinks is plugged in to any specific port at any time.

Z300M · Aug 9, 2016

SweetAndLow said:
the daX drive labels are meaningless for identifying a drive. They are randomly assigned during boot. No need to mess with the cli to replace your drive. Just go into the gui and use volume status and view disks tabs to find your disk and replace it. You shouldn't be using the cli to replace a disk or find out what disk needs replacement.

Are you sure the daX drive labels are "assigned randomly during boot"? My drives always seem to be associated with the same daX label.

mrichardson03 · Aug 9, 2016

Z300M said:
Are you sure the daX drive labels are "assigned randomly during boot"? My drives always seem to be associated with the same daX label.

That's my question as well. Mine were consistent until I rebooted with a failed drive, then they went all screwy.

SweetAndLow · Aug 9, 2016

100% positive they are assigned randomly.

philhu · Aug 9, 2016

Randomly, the same, if the hardware is the same. Mostly. I have 24 disks and MOSTLY, they stay the same.

The question though is then how do you know which disk went bad, if the slots do not correspond to names?
With 24 disks, in 2 vdevs, I can know which vdev failed, but how do you know which physical disk?

Stux · Aug 9, 2016

The disks have their serials on them on a barcode, and on the label. The serial matches the serial in FreeNAS. Bit rubbish.

In the past I've started a multi-drive read and pulled the drive that didn't have an activity light lighting.

Or alternatively, started a read from the device I wanted to pull, and pulled the one that was lighting up solidly.

Helps if you have activity lights ;)

SweetAndLow · Aug 9, 2016

You use the serial numbers in the gui and match that with what is written on the disk. It's very simple and I'm confused with what the problem is. You go to Volume Status and see what disk has failed then you go to View Disks and get the serial number for that disk. Finally go into your server and find that disk so you can remove it.

Z300M · Aug 9, 2016

Stux said:
The disks have their serials on them on a barcode, and on the label. The serial matches the serial in FreeNAS. Bit rubbish.

In the past I've started a multi-drive read and pulled the drive that didn't have an activity light lighting.

Or alternatively, started a read from the device I wanted to pull, and pulled the one that was lighting up solidly.

Helps if you have activity lights ;)

It also helps if you have the drives labeled in places where the serial numbers can be read without pulling the drive. I have labels with the serial numbers on the front of my drive trays. In some mounting arrangements, there is an exposed part of the edge of the drive; write the serial number there -- silver or gold "sharpie" marker works.

mrichardson03 · Aug 9, 2016

SweetAndLow said:
You use the serial numbers in the gui and match that with what is written on the disk. It's very simple and I'm confused with what the problem is. You go to Volume Status and see what disk has failed then you go to View Disks and get the serial number for that disk. Finally go into your server and find that disk so you can remove it.

It's not a question of how someone who knows what they're looking for can replace a drive. I'm looking for the best way to explain to someone who doesn't necessarily know what they're doing (be it a less technical significant other or hands in a remote data center) how to replace a failed drive that is a little less error prone than "pull drive with serial X". They might not see the digits on the bar code, or the enclosure might prevent them from reading it, so they might pull the wrong drive by mistake. If I could say "Pull drive with serial X, I know for a fact it's in slot 2" without physically being there to read the serial off the disk, that's less error prone.

Additionally, I'm catching some grief here because I didn't just go into the GUI and get the S/N of the failed disk. After I rebooted the box, it was gone from the GUI entirely. So it wasn't a question of go grab this disk out of the server, it was trying to figure out which serials I had before the failure, compare them to the ones I had after, and determine which actually failed. If I didn't go down to the CLI and manually go through logs, I wouldn't have had a serial number of a failed drive at all.

Z300M · Aug 9, 2016

SweetAndLow said:
100% positive they are assigned randomly.

Doesn't the daX designation go with a particular SATA port and therefore with a particular slot in the drive cage(s) as long as the connections aren't changed around?

danb35 · Aug 9, 2016

Z300M said:
Doesn't the daX designation go with a particular SATA port

No. They aren't assigned randomly (as @SweetAndLow) states, but they aren't entirely predictable either. The easy counterexample to your suggestion is if there is no drive attached to a given port--adaN isn't held for that port if there's no drive there.

The drive numbers are assigned in the order in which the OS sees the drives, which is ordinarily in the order in which the OS sees the SATA/SAS ports, which is ordinarily in the order in which they're labeled on the motherboard/backplane. But note the "ordinarily" in there a couple of times. If a drive is slow to appear, it might get a later drive letter. If a drive is hot-plugged in, it will definitely get a higher drive number than it otherwise would. And no doubt there are plenty of other reasons for discrepancies.

Z300M · Aug 9, 2016

danb35 said:
No. They aren't assigned randomly (as @SweetAndLow) states, but they aren't entirely predictable either. The easy counterexample to your suggestion is if there is no drive attached to a given port--adaN isn't held for that port if there's no drive there.

The drive numbers are assigned in the order in which the OS sees the drives, which is ordinarily in the order in which the OS sees the SATA/SAS ports, which is ordinarily in the order in which they're labeled on the motherboard/backplane. But note the "ordinarily" in there a couple of times. If a drive is slow to appear, it might get a later drive letter. If a drive is hot-plugged in, it will definitely get a higher drive number than it otherwise would. And no doubt there are plenty of other reasons for discrepancies.

OK. That makes sense.

Jailer · Aug 9, 2016

mrichardson03 said:
I'm looking for the best way to explain to someone who doesn't necessarily know what they're doing (be it a less technical significant other or hands in a remote data center) how to replace a failed drive that is a little less error prone than "pull drive with serial X". They might not see the digits on the bar code, or the enclosure might prevent them from reading it, so they might pull the wrong drive by mistake.

Proper planning solves this problem. ;)

Important Announcement for the TrueNAS Community.

SOLVED Can't tell if one or two drives failed, or worse...

Dabbler

MVP

Dabbler

Dabbler

MVP

Patron

Sweet'NASty

Dabbler

Guru

Dabbler

Sweet'NASty

Patron

MVP

Sweet'NASty

Guru

Dabbler

Guru

Hall of Famer

Guru

Not strong, but bad

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Can't tell if one or two drives failed, or worse..."

Similar threads