Rubbing My Head - Pool Went Degraded

xCatalystx · Mar 15, 2020

So I had a pool degrade (raidz2) during my fortnightly scrub. However, I am scratching my head as to why. Error below.

Code:

> (da1:mps0:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
> (da1:mps0:0:1:0): CAM status: SCSI Status Error
> (da1:mps0:0:1:0): SCSI status: Check Condition
> (da1:mps0:0:1:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure)
> (da1:mps0:0:1:0): Error 5, Unretryable error

This is a snippet, the same error was repeated alot before the disk was dropped.

I have since rebooted, the disk was picked back up and reslivered, run smartctl short+long scan of the disk as well as a scrub (currently at 90%) and so far no errors (or rewrite data) or anything else looking out of the ordinary.

SmartCTL reported No CRC Errors, No Unreadable Sectors, No Pending or Relocated Sectors.

Do you think it is safe to ignore? The disk is just over 2 years old (WD RED 6tb). The important data is backed up, but the rest currently isn't (was just about to order new drives before the C-VIRUS became an issue were I live), which would be more painful than devastating to lose.

I have also checked smart on the other disks, they're all fine as well.

hervon · Mar 15, 2020

Searched ? Already reported quite a few times.

Google Search

SOLVED - CAM status: SCSI Status Error ABORTED COMMAND NAK received

My newly assembled FreeNAS server gave up tonight and stopped providing storage to my VMware vSphere environment. The server remained powered on but didn't appear to have signs of life at the console and the HTTP UI page wasn't responding. I powered the box off and then back on. Upon reboot, I...

www.ixsystems.com

Might have more to do with the interface (cables, etc) than the drive itself.

xCatalystx · Mar 15, 2020

I did see that and was hoping it's not the interface. I checked the cables, they are all seated correctly and haven't moved. They are lightly cabled tied to prevent them from ending up in the CPU fan. There isn't any stress on the cable or the ports. I did that the chance (during shutdown/poweron) to clean out the system and reseat the connectors (just incase).

I guess my sanity check is to make sure the drive (for now) is ok. As I do not want to be swapping hardware out currently as I am about to go interstate for a week or 2 for work (leaving tomorrow afternoon). The plan was to purchase 4x 12tb drives in a few weeks for a local backup.

Just trying to work out my risk so it can be weighted.

sretalla · Mar 15, 2020

SMART output for the drive?

smartctl -a /dev/da1

xCatalystx · Mar 15, 2020

Here you go, seems ok to me. Cut/paste from iPad so just included the main bits.

Edit: just checked, zpool status shows complete with no errors/repaired. It’s just the old message because I’ve let to clear.

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   197   021    Pre-fail  Always       -       9141
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22434
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       4636
194 Temperature_Celsius     0x0022   116   105   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22422         -
# 2  Short offline       Completed without error       00%     22303         -
# 3  Short offline       Completed without error       00%     22183         -
# 4  Extended offline    Completed without error       00%     21981         -
# 5  Short offline       Completed without error       00%     21728         -
# 6  Short offline       Completed without error       00%     21608         -
# 7  Short offline       Completed without error       00%     21488         -
# 8  Short offline       Completed without error       00%     21344         -
# 9  Extended offline    Completed without error       00%     21238         -
#10  Short offline       Completed without error       00%     20984         -
#11  Short offline       Completed without error       00%     20864         -
#12  Short offline       Completed without error       00%     20745         -
#13  Short offline       Completed without error       00%     20601         -
#14  Extended offline    Completed without error       00%     20494         -
#15  Short offline       Completed without error       00%     20241         -
#16  Short offline       Completed without error       00%     20121         -
#17  Short offline       Completed without error       00%     20001         -
#18  Short offline       Completed without error       00%     19881         -
#19  Extended offline    Completed without error       00%     19775         -
#20  Short offline       Completed without error       00%     19522         -
#21  Short offline       Completed without error       00%     19402         -

sretalla · Mar 15, 2020

Looks healthy, I would continue to monitor and think about the cabling to the disk in addition to power supply reliability, but otherwise, the disk is good.

Scharbag · Mar 15, 2020

Yeah, I had some SAS cables go bad a while back. Threw some really strange errors and took a bit to nail that down as the problem. Cables are cheap and easy to replace. :D

xCatalystx · Mar 15, 2020

Thanks, guys, I've got spares so i'll try and replace the cables today before I leave.

Important Announcement for the TrueNAS Community.

Rubbing My Head - Pool Went Degraded

xCatalystx

Contributor

hervon

Patron

SOLVED - CAM status: SCSI Status Error ABORTED COMMAND NAK received

xCatalystx

Contributor

sretalla

Powered by Neutrality

xCatalystx

Contributor

sretalla

Powered by Neutrality

Scharbag

Guru

xCatalystx

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Rubbing My Head - Pool Went Degraded

Contributor

Patron

Contributor

Powered by Neutrality

Contributor

Powered by Neutrality

Guru

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Rubbing My Head - Pool Went Degraded"

Similar threads