GrimmReaperNL
Explorer
- Joined
- Jan 24, 2022
- Messages
- 58
Hi Everybody,
Earlier this week my boot-pool suddenly went 'degraded'. Running
I pulled the ssd from the server, hooked it up to my pc with one of those sata to usb thingies and looked up the smart values in Speccy. It showed all green/good.
Plugged it back into the server, cleared it with
Then the pool went degraded again. So I pulled it again, hooked it up to my pc again and used 'diskcheckup' to run a short and long smart test.
They both came back with no issues. I shrugged, remembered I read somewhere sometimes a sata cable can cause issues.
So I plugged the drive back into the server with a different cable, same port. Cleared zpool, resilvered, ran a short smart test. again, all good.
Now today it's gone degraded again. @joeschmuck's multi report script email shows this right now. Giving both smart and zpool information. (I cut all the smart info except for the 'failed' drive):
sdj is supposed to be the bad drive. I mean, yeah, there's now a SATA_Phy_Error_Count and CRC_Error_Count. But I feel like that's likely from just pulling the drive from the server without powering it down. As you can't 'offline' drives in your boot-pool.
Does anyone have any insight in what to do now? As SMART looks okay, I can't really go back to the retailer (it's a pretty new drive) and say it's faulty.
Thanks for your wisdom.

Earlier this week my boot-pool suddenly went 'degraded'. Running
zpool status -v
showed the pool had both read and write errors and one of the drives didn't show. In 'manage disks' it still showed so I tried to SMART test , but those came back failed.I pulled the ssd from the server, hooked it up to my pc with one of those sata to usb thingies and looked up the smart values in Speccy. It showed all green/good.
Plugged it back into the server, cleared it with
zpool clear
. it resilvered, ran a short smart test. all looked good.Then the pool went degraded again. So I pulled it again, hooked it up to my pc again and used 'diskcheckup' to run a short and long smart test.
They both came back with no issues. I shrugged, remembered I read somewhere sometimes a sata cable can cause issues.
So I plugged the drive back into the server with a different cable, same port. Cleared zpool, resilvered, ran a short smart test. again, all good.
Now today it's gone degraded again. @joeschmuck's multi report script email shows this right now. Giving both smart and zpool information. (I cut all the smart info except for the 'failed' drive):
sdj is supposed to be the bad drive. I mean, yeah, there's now a SATA_Phy_Error_Count and CRC_Error_Count. But I feel like that's likely from just pulling the drive from the server without powering it down. As you can't 'offline' drives in your boot-pool.
Does anyone have any insight in what to do now? As SMART looks okay, I can't really go back to the retailer (it's a pretty new drive) and say it's faulty.
Thanks for your wisdom.
Code:Multi-Report v2.0.10 dtd:2023-03-06 (TrueNAS Scale 22.12.1) Report Run 11-Mar-2023 @ 16:28:28.53 *ZPool/ZFS Status Report Summary Pool Name Status Pool Size Free Space Used Space Frag Read Errors Write Errors Cksum Errors Scrub Repaired Bytes Scrub Errors Last Scrub Age Last Scrub Duration TrueNAS ONLINE 102T 46.0T 55.6T (54%) 2% 0 0 0 0 0 30 15:50:43 boot-pool DEGRADED 107.73G 105G 2.73G (2%) 2% 0 22 0 --- --- Resilvered --- nvme-pool ONLINE 899.70G 874G 25.7G (2%) 0% 0 0 0 0 0 7 00:00:25 *Data obtained from zpool and zfs commands. Spinning Rust Summary Report Device ID Serial Number Model Number HDD Capacity RPM SMART Status Curr Temp Temp Min Temp Max Power On Time Start Stop Count Load Cycle Count Spin Retry Count Re-alloc Sects Re-alloc Evnt Curr Pend Sects Offl Unc Sects UDMA CRC Error Read Error Rate Seek Error Rate Multi Zone Error He Level Last Test Age Last Test Type /dev/sda 7130A0CVFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 30*C 24*C 35*C 8769 50 56 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdb 7130A0D6FVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 30*C 24*C 35*C 8770 50 56 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdc 7130A09EFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 30*C 23*C 34*C 8769 50 55 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdd 7130A0CRFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 28*C 23*C 33*C 8759 30 34 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sde 42U0A0MYF94G TOSHIBA MG07ACA14TE 14.0TB 7200 PASSED 31*C 25*C 35*C 1241 7 7 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdf 7130A0BEFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 30*C 23*C 34*C 8770 50 54 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdg 7130A0CBFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 31*C 23*C 35*C 8769 50 55 0 0(3) 0(1) 0 0 0 0 0 --- --- 0 Short /dev/sdi 7130A0BWFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 31*C 23*C 36*C 8769 50 56 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdk 7130A05WFVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 30*C 22*C 34*C 8771 52 56 0 0 0 0 0 0 0 0 --- --- 0 Short /dev/sdl 7130A037FVJG TOSHIBA MG08ACA14TE 14.0TB 7200 PASSED 28*C 22*C 32*C 8771 52 57 0 0(3) 0(2) 0 0 0 0 0 --- --- 0 Short /dev/sdm 42U0A0BCF94G TOSHIBA MG07ACA14TE 14.0TB 7200 PASSED 32*C 23*C 35*C 4415 8 8 0 0 0 0 0 0 0 0 --- --- 0 Short SSD Summary Report Device ID Serial Number Model Number HDD Capacity SMART Status Curr Temp Temp Min Temp Max Power On Time Wear Level Re-alloc Sects Re-alloc Evnt Curr Pend Sects Offl Unc Sects UDMA CRC Error Last Test Age Last Test Type /dev/sdh 50026B7381A308DD KINGSTON SA400S37120G 120GB PASSED 26*C 19*C 37*C 329 99 --- 0 --- --- 0 0 Short /dev/sdj 50026B7381A31481 KINGSTON SA400S37120G 120GB PASSED 28*C 21*C 38*C 311 99 --- 0 --- --- 0 0 Short NVMe Summary Report Device ID Serial Number Model Number HDD Capacity SMART Status Critical Warning Curr Temp Power On Time Wear Level /dev/nvme0n1 2235E65C03BB CT1000P3PSSD8 1.00TB PASSED GOOD 31*C 327 100 Multi-Report Text Section External Configuration File in use dtd:2023-01-22 Statistical Export Log Located: /mnt/TrueNAS/scripts/statisticalsmartdata.csv Emailed every: Mon CRITICAL LOG FILE boot-pool - Scrub Online Error END WARNING LOG FILE boot-pool - Scrub Write Errors END ########## ZPool status report for TrueNAS ########## pool: TrueNAS state: ONLINE scan: scrub repaired 0B in 15:50:43 with 0 errors on Thu Feb 9 12:25:26 2023 config: NAME STATE READ WRITE CKSUM TrueNAS ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 0a38cf8b-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 09737924-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 d9fd11cf-9761-11ed-abfe-3cecef8c44fa ONLINE 0 0 0 0a014274-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 0b3f1b68-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 0ae27c9a-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 56f3655d-2ed5-11ed-ab7a-3cecef8c44fa ONLINE 0 0 0 0a775747-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 0dcc0444-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 0a474b61-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 0cf68994-a583-11ec-9714-3cecef8c44fa ONLINE 0 0 0 errors: No known data errors Drives for this pool are listed below: 0a38cf8b-a583-11ec-9714-3cecef8c44fa -> sdl2 09737924-a583-11ec-9714-3cecef8c44fa -> sdb2 d9fd11cf-9761-11ed-abfe-3cecef8c44fa -> sde2 0a014274-a583-11ec-9714-3cecef8c44fa -> sdf2 0b3f1b68-a583-11ec-9714-3cecef8c44fa -> sdi2 0ae27c9a-a583-11ec-9714-3cecef8c44fa -> sdd2 56f3655d-2ed5-11ed-ab7a-3cecef8c44fa -> sdm2 0a775747-a583-11ec-9714-3cecef8c44fa -> sdk2 0dcc0444-a583-11ec-9714-3cecef8c44fa -> sdc2 0a474b61-a583-11ec-9714-3cecef8c44fa -> sda2 0cf68994-a583-11ec-9714-3cecef8c44fa -> sdg2 ########## ZPool status report for boot-pool ########## pool: boot-pool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: resilvered 28.2M in 00:00:00 with 0 errors on Thu Mar 9 00:55:06 2023 config: NAME STATE READ WRITE CKSUM boot-pool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 sdh3 ONLINE 0 0 0 sdj3 FAULTED 0 22 0 too many errors errors: No known data errors ########## ZPool status report for nvme-pool ########## pool: nvme-pool state: ONLINE scan: scrub repaired 0B in 00:00:25 with 0 errors on Sun Mar 5 00:00:26 2023 config: NAME STATE READ WRITE CKSUM nvme-pool ONLINE 0 0 0 9516f97e-fb7c-4c3a-9b54-3d949db386db ONLINE 0 0 0 errors: No known data errors Drives for this pool are listed below: 9516f97e-fb7c-4c3a-9b54-3d949db386db -> nvme0n1p2 ########## SMART status report for sdj drive (Phison Driven SSDs : 50026B7381A31481) ########## SMART overall-health self-assessment test result: PASSED ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x0032 100 100 000 Old_age Always - 100 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 311 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11 148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0 149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0 167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0 168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 1 169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 0 170 Bad_Blk_Ct_Lat/Erl 0x0000 100 100 010 Old_age Offline - 0/0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 0 181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 8 194 Temperature_Celsius 0x0022 029 038 000 Old_age Always - 29 (Min/Max 21/38) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1 231 SSD_Life_Left 0x0000 099 099 000 Old_age Offline - 99 233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 284 241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 310 242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 36 244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 10 245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 21 246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 10399 Warning: ATA error count 0 inconsistent with error log pointer 1 ATA Error Count: 0 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 00 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- b0 d0 01 00 4f c2 00 08 00:00:00.000 SMART READ DATA b0 d1 01 01 4f c2 00 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4] b0 da 00 00 4f c2 00 08 00:00:00.000 SMART RETURN STATUS b0 d5 01 00 4f c2 00 08 00:00:00.000 SMART READ LOG b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG Num Test_Description (Most recent Short & Extended Tests - Listed by test number) # 1 Short offline Completed without error 00% 298 - # 5 Extended offline Completed without error 00% 246 - SCT Error Recovery Control: SCT Commands not supported End of data section multi_report_config.txt 20K View Download
Last edited by a moderator: