Chip28
Cadet
- Joined
- May 5, 2014
- Messages
- 6
--edit, looks like I forgot to add the question tag... just imagine it's there please...
Afternoon All!
Apologies in advanced for the long post. (Mods, if I'm in the wrong place, please let me know) I have both questions and a problem with my build. I've lurked on the forums for a while, lots of awesome help here and I'm hoping maybe someone can help me out a bit as well. I fully expect to get a bit of grief for how it's set up, but I did best I could at the time:
Setup:
The 3T drives are in a 2 x RaidZ2 (As recommended by the "creation" wizard) : --exit added below and changed to raidz2
The problems:
Before I get started with it, I would like to say I DO have smart checks and scrubs set up AND the emails come through just fine, I do get one ever boot >.> (part of the problem, mentioned below).
When writing files, it will stall out, moving to a creeping pace, eventually picking back up (sometimes). The error log shows things like:
If you notice, it's typically a set of drives, not one in particular. Drives 0-3 and 4-7, respectively, are each on an esata cable. I have replaced the esata cables with well recommended ones from amazon thinking it would solve the problem, it didn't.
The email mentioned above states:
The WD Drive tools verifies that that one particular drive is failing. SeagateSeatools BSOD's both my windows computers when trying to scan one of the seagate drives.
SMART Data
Drive0 (seagate)
Drive1 (seagate)
Drive2 (WD) (verified failing0
Drive3 (WD) - No logged SMART Errors --edit, changed from seagate to WD
The rest of the drives don't have anything interesting AFAIK in the SMART Logs yet STILL get mentioned in the syslog.
Questions:
1) Are all my drives throwing errors because of the WD (and possibly the 2 seagates) are failing?
2) Are WD RED's the best way to go? Anyone want to comment on what drive you're using?
3) I've had a linux server for media and such for the past 3ish years typically running seagate drives (because they're cheap) and on 24x7. I've RMA'd 3 of them so far out of this machine alone, a couple other from other machines. The computer rarely physically moves. I'm I just having awful luck or am I doing something completely wrong?
4) Is my ZFS setup OK? is there a better config that I could/should use that wouldn't burn so many of my drives (1/2) as redundancy?
Last question, Promise
5) Any suggestions for a relatively in-expensive rackmount JBOD enclosure? (I'm a college student doing this more or less as a hobby so I cant be super extravagant)
as a psudo-question, any suggestions for resources on understanding the smart data?
Feel free to bash away. I have read MOST of the manual.. but it's long, and it's finals week ;) ...
Should I attach/post the smart results for all 8 drives? Nothing like a monster wall of text to discourage help :(
Afternoon All!
Apologies in advanced for the long post. (Mods, if I'm in the wrong place, please let me know) I have both questions and a problem with my build. I've lurked on the forums for a while, lots of awesome help here and I'm hoping maybe someone can help me out a bit as well. I fully expect to get a bit of grief for how it's set up, but I did best I could at the time:
Setup:
Dell Precision R5400
2 x Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
12(ish)GB of ECC RAM
SanDisk Cruzer Fit 16GB
2 x Seagate ST3500418AS - Jails/Transcoding etc
Monoprice 103581 PCI Express Serial ATA II
Sans Digital TowerRAID TR8M+B
6 x ST3000DM 3T
2 x WD RED 3T
The 3T drives are in a 2 x RaidZ2 (As recommended by the "creation" wizard) : --exit added below and changed to raidz2
Code:
NAME STATE READ WRITE CKSUM mediastore ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 gptid/1cfdfa5d-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 gptid/1dd5426a-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 gptid/1e381275-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 gptid/1ea087d9-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 gptid/1f01fd79-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 gptid/1f6b3224-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 gptid/2049a02f-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0 gptid/21453905-9de6-11e3-a6e9-00219b58171e ONLINE 0 0 0
The problems:
Before I get started with it, I would like to say I DO have smart checks and scrubs set up AND the emails come through just fine, I do get one ever boot >.> (part of the problem, mentioned below).
When writing files, it will stall out, moving to a creeping pace, eventually picking back up (sometimes). The error log shows things like:
Code:
May 5 17:09:15 eve kernel: (ada4:siisch1:0:0:0): CAM status: ATA Status Error May 5 17:09:15 eve kernel: (ada4:siisch1:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT ) May 5 17:09:15 eve kernel: (ada4:siisch1:0:0:0): RES: 41 84 80 f3 9f 00 4a 00 00 00 01 May 5 17:09:15 eve kernel: (ada4:siisch1:0:0:0): Retrying command May 5 17:09:16 eve kernel: (ada5:siisch1:0:1:0): WRITE_FPDMA_QUEUED. ACB: 61 88 80 24 a0 40 4a 00 00 00 00 00 May 5 17:09:16 eve kernel: (ada5:siisch1:0:1:0): CAM status: ATA Status Error May 5 17:09:16 eve kernel: (ada5:siisch1:0:1:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT ) May 5 17:09:16 eve kernel: (ada5:siisch1:0:1:0): RES: 41 84 18 f7 9f 00 4a 00 00 88 00 May 5 17:09:16 eve kernel: (ada5:siisch1:0:1:0): Retrying command May 5 17:09:16 eve kernel: siisch0: Error while READ LOG EXT May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): WRITE_FPDMA_QUEUED. ACB: 61 88 58 0e a0 40 4a 00 00 00 00 00 May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): CAM status: ATA Status Error May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): ATA status: 00 () May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): RES: 00 00 00 00 00 00 00 00 00 00 00 May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): Retrying command May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): WRITE_FPDMA_QUEUED. ACB: 61 00 68 0f a0 40 4a 00 00 01 00 00 May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): CAM status: ATA Status Error May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): ATA status: 00 () May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): RES: 00 00 00 00 00 00 00 00 00 00 00 May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): Retrying command May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): WRITE_FPDMA_QUEUED. ACB: 61 88 e0 0e a0 40 4a 00 00 00 00 00 May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): CAM status: ATA Status Error May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): ATA status: 00 () May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): RES: 00 00 00 00 00 00 00 00 00 00 00 May 5 17:09:17 eve kernel: (ada3:siisch0:0:3:0): Retrying command May 5 17:09:17 eve kernel: (ada2:siisch0:0:2:0): WRITE_FPDMA_QUEUED. ACB: 61 88 40 1b a0 40 4a 00 00 00 00 00 May 5 17:09:17 eve kernel: (ada2:siisch0:0:2:0): CAM status: ATA Status Error May 5 17:09:17 eve kernel: (ada2:siisch0:0:2:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT ) May 5 17:09:17 eve kernel: (ada2:siisch0:0:2:0): RES: 41 84 6f 1b a0 40 4a 00 00 00 00 May 5 17:09:17 eve kernel: (ada2:siisch0:0:2:0): Retrying command May 5 17:09:18 eve kernel: (ada4:siisch1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 88 78 32 a0 40 4a 00 00 00 00 00 May 5 17:09:18 eve kernel: (ada4:siisch1:0:0:0): CAM status: ATA Status Error May 5 17:09:18 eve kernel: (ada4:siisch1:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT ) May 5 17:09:18 eve kernel: (ada4:siisch1:0:0:0): RES: 41 84 78 32 a0 00 4a 00 00 88 00 May 5 17:09:18 eve kernel: (ada4:siisch1:0:0:0): Retrying command May 5 17:09:18 eve kernel: (ada5:siisch1:0:1:0): WRITE_FPDMA_QUEUED. ACB: 61 88 70 41 a0 40 4a 00 00 00 00 00 May 5 17:09:18 eve kernel: (ada5:siisch1:0:1:0): CAM status: ATA Status Error May 5 17:09:18 eve kernel: (ada5:siisch1:0:1:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT ) May 5 17:09:18 eve kernel: (ada5:siisch1:0:1:0): RES: 41 84 70 41 a0 00 4a 00 00 88 00
If you notice, it's typically a set of drives, not one in particular. Drives 0-3 and 4-7, respectively, are each on an esata cable. I have replaced the esata cables with well recommended ones from amazon thinking it would solve the problem, it didn't.
The email mentioned above states:
Code:
Device: /dev/ada2, 41 Currently unreadable (pending) sectors Device info: WDC WD30EFRX-68EUZN0, S/N:WD-WCCXXXXXXXXX, WWN:5-0014ee-XXXXXXXXX, FW:80.00A80, 3.00 TB
The WD Drive tools verifies that that one particular drive is failing. SeagateSeatools BSOD's both my windows computers when trying to scan one of the seagate drives.
SMART Data
Drive0 (seagate)
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 61103656 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 79 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 19881043 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11248 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 85 183 Runtime_Bad_Block 0x0032 001 001 000 Old_age Always - 104 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 094 094 000 Old_age Always - 6 188 Command_Timeout 0x0032 100 003 000 Old_age Always - 39 82 2765 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 058 051 045 Old_age Always - 42 (Min/Max 37/44) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 62 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1754 194 Temperature_Celsius 0x0022 042 049 000 Old_age Always - 42 (0 15 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 092 000 Old_age Always - 23326 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 9058h+56m+29.965s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 64461077885046 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 139361752603292 ... Error 4 occurred at disk power-on lifetime: 11063 hours (460 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 2d+09:33:19.740 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 2d+09:33:19.740 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+09:33:19.249 READ LOG EXT 60 00 00 ff ff ff 4f 00 2d+09:33:16.501 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 2d+09:33:16.501 READ FPDMA QUEUED
Drive1 (seagate)
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 61103656 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 79 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 19881043 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 11248 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 85 183 Runtime_Bad_Block 0x0032 001 001 000 Old_age Always - 104 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 094 094 000 Old_age Always - 6 188 Command_Timeout 0x0032 100 003 000 Old_age Always - 39 82 2765 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 058 051 045 Old_age Always - 42 (Min/Max 37/44) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 62 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1754 194 Temperature_Celsius 0x0022 042 049 000 Old_age Always - 42 (0 15 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 092 000 Old_age Always - 23326 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 9059h+00m+43.746s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 64461077885046 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 139361752603292 ... Error 4 occurred at disk power-on lifetime: 11063 hours (460 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 2d+09:33:19.740 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 2d+09:33:19.740 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+09:33:19.249 READ LOG EXT 60 00 00 ff ff ff 4f 00 2d+09:33:16.501 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 2d+09:33:16.501 READ FPDMA QUEUED Error 3 occurred at disk power-on lifetime: 11063 hours (460 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 00 ff ff ff 4f 00 2d+09:33:16.501 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 2d+09:33:16.501 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 2d+09:33:16.107 READ LOG EXT 60 00 00 ff ff ff 4f 00 2d+09:33:13.379 READ FPDMA QUEUED 60 00 00 ff ff ff 4f 00 2d+09:33:13.379 READ FPDMA QUEUED
Drive2 (WD) (verified failing0
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 109 051 Pre-fail Always - 6320 3 Spin_Up_Time 0x0027 173 172 021 Pre-fail Always - 6316 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 47 5 Reallocated_Sector_Ct 0x0033 175 175 140 Pre-fail Always - 758 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2847 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 45 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 34 193 Load_Cycle_Count 0x0032 185 185 000 Old_age Always - 46395 194 Temperature_Celsius 0x0022 111 104 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 41 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 001 000 Old_age Always - 24372 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 58 ... Error 1598 occurred at disk power-on lifetime: 1287 hours (53 days + 15 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 d0 8f 1d e4 Error: UNC 8 sectors at LBA = 0x041d8fd0 = 69046224 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 d0 8f 1d e4 00 2d+04:03:30.791 READ DMA ec 00 00 00 00 00 a0 00 2d+04:03:30.790 IDENTIFY DEVICE ef 03 44 00 00 00 a0 00 2d+04:03:30.790 SET FEATURES [Set transfer mode] ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed: read failure 90% 2697 55757464 # 2 Extended offline Completed: read failure 70% 1421 58783072 # 3 Extended offline Completed: read failure 90% 1415 55764744 # 4 Extended offline Completed: read failure 70% 1387 55757464 # 5 Extended offline Completed: read failure 90% 1383 58477104 # 6 Extended offline Completed: read failure 90% 1368 58470232 # 7 Extended offline Completed: read failure 60% 1349 61200168 # 8 Extended offline Completed: read failure 70% 1346 63843408 # 9 Extended offline Completed: read failure 90% 1148 55768184
Drive3 (WD) - No logged SMART Errors --edit, changed from seagate to WD
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 180 179 021 Pre-fail Always - 5958 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 81 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6024 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 79 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 55 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 25 194 Temperature_Celsius 0x0022 110 105 000 Old_age Always - 40 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 001 000 Old_age Always - 26238 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
The rest of the drives don't have anything interesting AFAIK in the SMART Logs yet STILL get mentioned in the syslog.
Questions:
1) Are all my drives throwing errors because of the WD (and possibly the 2 seagates) are failing?
2) Are WD RED's the best way to go? Anyone want to comment on what drive you're using?
3) I've had a linux server for media and such for the past 3ish years typically running seagate drives (because they're cheap) and on 24x7. I've RMA'd 3 of them so far out of this machine alone, a couple other from other machines. The computer rarely physically moves. I'm I just having awful luck or am I doing something completely wrong?
4) Is my ZFS setup OK? is there a better config that I could/should use that wouldn't burn so many of my drives (1/2) as redundancy?
Last question, Promise
5) Any suggestions for a relatively in-expensive rackmount JBOD enclosure? (I'm a college student doing this more or less as a hobby so I cant be super extravagant)
as a psudo-question, any suggestions for resources on understanding the smart data?
Feel free to bash away. I have read MOST of the manual.. but it's long, and it's finals week ;) ...
Should I attach/post the smart results for all 8 drives? Nothing like a monster wall of text to discourage help :(