dewhite04
Cadet
- Joined
- Jan 24, 2018
- Messages
- 7
Hello all,
I'm new to TrueNAS (SCALE), coming from ten+ years of keeping my files on a gentoo linux mdadm raid5 with ext4 over LUKS. Still feeling my way around but I've hit a little bump last night...
Before I unload the information I have so far, my basic system specs are in the signature below.
I received an email yesterday evening just after 8pm. Just for reference, I haven't physically touched or changed any configuration or settings in several days.
I looked at the pool status:
I see that there are 11 uncorrected read errors on the disk whose partition gpt-id ends in 80bf, so I look at the list of disks and determined that was /dev/sdd. Checking the output of smartctl -a /dev/sdd | less showed no Reallocated, Current_Pending, or Offline_Uncorrectable Sectors, etc, so I rant a -t long test overnight and found the same results this morning.
I went back to the system logs and looked for events around the time of the email Alert and found a pair of closely spaced resets, just on this disk, around the time of the event.
The first and last lines in the log were hours or days before or after the event, so it does not appear to be an ongoing event.
I can't make any sense of these messages from a hardware perspective... The drive in question is in one of two hot-swap style drive cages containing 8x SATA drives of various sizes in this RaidZ2 pool (mix of 3T, 5T, and 6T; planning to migrate to 5T and 6T in the next few weeks), and they are all connected to an IBM ServeRAID M1015 (LSI 9220-8i flashed to IT-mode) through a pair of SFF-8087 to SATA cables. The drive obviously has some hours on it, but I beat the hell out of it before replacing a 2T with it (full 4-phase run of badblocks, followed by a long SMART test).
I don't really believe its a genuine power delivery problem because the other two drives in that cage should have been affected also (common supply for all drives in the cage). Hard to believe it would be a faulty cable because I have been using these for a few years without problems. Also, I stood this server up on April 28th and haven't had an issue like this since the initial boot. In fact, I haven't yet restarted the NAS since its first boot.
All the data I care about is backed up in a couple of other places, so I don't have a gun to my head. My thinking right now is that I should issue a zpool clear and see what happens next. Conceptually, is there some value to doing a scrub before doing that? Any additional troubleshooting or interpretation of the information already available that I should be considering?
Any thoughts, questions, advice, references, rants, etc will be deeply appreciated.
I'm new to TrueNAS (SCALE), coming from ten+ years of keeping my files on a gentoo linux mdadm raid5 with ext4 over LUKS. Still feeling my way around but I've hit a little bump last night...
Before I unload the information I have so far, my basic system specs are in the signature below.
I received an email yesterday evening just after 8pm. Just for reference, I haven't physically touched or changed any configuration or settings in several days.
Code:
New alerts: • Pool data state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
I looked at the pool status:
Code:
root@TrueNAS[~]# zpool status -v pool: boot-pool state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub repaired 0B in 00:00:05 with 0 errors on Thu May 12 05:45:06 2022 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 nvme0n1p3 ONLINE 0 0 0 errors: No known data errors pool: data state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub repaired 0B in 06:36:11 with 0 errors on Thu May 12 00:06:57 2022 config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 634627f6-3f8c-4030-aae9-df54d9a7e079 ONLINE 0 0 0 53276fbb-7c18-41b4-a0cb-fec9ecc6d6ff ONLINE 0 0 0 44141a7c-2183-4b54-80d6-5f03a4fd73fc ONLINE 0 0 0 f5e73d43-423c-4c61-926a-ef831547a4e3 ONLINE 0 0 0 00b03b46-fb24-4cd5-bb1d-75f1d411a781 ONLINE 0 0 0 7639837d-962a-469f-b170-4ce4a146ee81 ONLINE 0 0 0 b76dc0ce-9f34-491c-8118-b3fbd55080bf ONLINE 11 0 0 74ee8d93-5fd7-43ab-97dc-24ef7b092635 ONLINE 0 0 0 errors: No known data errors
I see that there are 11 uncorrected read errors on the disk whose partition gpt-id ends in 80bf, so I look at the list of disks and determined that was /dev/sdd. Checking the output of smartctl -a /dev/sdd | less showed no Reallocated, Current_Pending, or Offline_Uncorrectable Sectors, etc, so I rant a -t long test overnight and found the same results this morning.
Code:
root@TrueNAS[~]# smartctl -a /dev/sdd smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.93+truenas] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: TOSHIBA MD04ACA500 Serial Number: 35DGK54JFS9A LU WWN Device Id: 5 000039 62be8146c Firmware Version: FP2A User Capacity: 5,000,981,078,016 bytes [5.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed May 18 10:15:09 2022 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 578) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 505 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19509 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 094 050 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 001 001 000 Old_age Always - 58776 10 Spin_Retry_Count 0x0033 253 100 030 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 185 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 78 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 146 193 Load_Cycle_Count 0x0032 096 096 000 Old_age Always - 40038 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 51 (Min/Max 6/57) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 253 000 Old_age Always - 0 220 Disk_Shift 0x0002 100 100 000 Old_age Always - 0 222 Loaded_Hours 0x0032 012 012 000 Old_age Always - 35348 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0 224 Load_Friction 0x0022 100 100 000 Old_age Always - 0 226 Load-in_Time 0x0026 100 100 000 Old_age Always - 209 240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 58774 - # 2 Short offline Completed without error 00% 58762 - # 3 Short offline Completed without error 00% 58584 - # 4 Extended offline Completed without error 00% 58572 - # 5 Extended offline Aborted by host 70% 27109 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
I went back to the system logs and looked for events around the time of the email Alert and found a pair of closely spaced resets, just on this disk, around the time of the event.
Code:
May 14 00:00:17 TrueNAS syslog-ng[20741]: Configuration reload finished; May 17 20:13:21 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:13:21 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:13:21 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2407 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2407 CDB: Read(16) 88 00 00 00 00 00 5c 38 8c d0 00 00 00 28 00 00 May 17 20:13:21 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=790023938048 size=20480 flags=180980 May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2411 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2411 CDB: Read(16) 88 00 00 00 00 00 5c 38 9d 38 00 00 00 88 00 00 May 17 20:13:21 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=790026088448 size=69632 flags=40080c80 May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2404 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#2404 CDB: Read(16) 88 00 00 00 00 00 5c 38 9c b0 00 00 00 58 00 00 May 17 20:13:21 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=790026018816 size=45056 flags=40080c80 May 17 20:13:21 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred May 17 20:13:22 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred May 17 20:23:58 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred May 17 20:23:58 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:24:02 TrueNAS kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3043 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3043 CDB: Read(16) 88 00 00 00 00 02 46 30 cd a0 00 00 00 08 00 00 May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3008 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=2s May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3008 CDB: Read(16) 88 00 00 00 00 00 5d f9 47 d8 00 00 00 28 00 00 May 17 20:24:02 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=805080838144 size=20480 flags=180980 May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3017 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3017 CDB: Read(16) 88 00 00 00 00 00 5d f9 56 18 00 00 00 28 00 00 May 17 20:24:02 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=805082705920 size=20480 flags=180980 May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3011 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=3s May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: [sdd] tag#3011 CDB: Read(16) 88 00 00 00 00 00 5d f9 48 08 00 00 00 28 00 00 May 17 20:24:02 TrueNAS kernel: zio pool=data vdev=/dev/disk/by-partuuid/b76dc0ce-9f34-491c-8118-b3fbd55080bf error=5 type=1 offset=805080862720 size=20480 flags=180980 May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred May 17 20:24:02 TrueNAS kernel: sd 0:0:13:0: Power-on or device reset occurred May 18 06:26:40 TrueNAS kernel: loop: module loaded
The first and last lines in the log were hours or days before or after the event, so it does not appear to be an ongoing event.
I can't make any sense of these messages from a hardware perspective... The drive in question is in one of two hot-swap style drive cages containing 8x SATA drives of various sizes in this RaidZ2 pool (mix of 3T, 5T, and 6T; planning to migrate to 5T and 6T in the next few weeks), and they are all connected to an IBM ServeRAID M1015 (LSI 9220-8i flashed to IT-mode) through a pair of SFF-8087 to SATA cables. The drive obviously has some hours on it, but I beat the hell out of it before replacing a 2T with it (full 4-phase run of badblocks, followed by a long SMART test).
I don't really believe its a genuine power delivery problem because the other two drives in that cage should have been affected also (common supply for all drives in the cage). Hard to believe it would be a faulty cable because I have been using these for a few years without problems. Also, I stood this server up on April 28th and haven't had an issue like this since the initial boot. In fact, I haven't yet restarted the NAS since its first boot.
All the data I care about is backed up in a couple of other places, so I don't have a gun to my head. My thinking right now is that I should issue a zpool clear and see what happens next. Conceptually, is there some value to doing a scrub before doing that? Any additional troubleshooting or interpretation of the information already available that I should be considering?
Any thoughts, questions, advice, references, rants, etc will be deeply appreciated.