badblocks errors

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
I'm testing the burn-in of a new Seagate IronWolf 4TB with badblocks by following the guide. The short, conveyance & long SMART tests before running badblocks, came back without errors.

The disk is attached to my backup server (hp n36L, specs in sig) by eSATA and hosted in an Anker external HDD dock.

Badblocks now has this in its output:

Code:
[saurav@freenas-backup ~/newdisk]$ sudo badblocks -b 4096 -ws /dev/ada4
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done                                       
Reading and comparing: done                 
Testing with pattern 0x55: done             
Reading and comparing: 923335392one, 28:42:47 elapsed. (0/0/0 errors)
923335393                                   
923335394                                   
923335395                                   
...
923335486                                   
923335487                                   
923347651one, 28:42:48 elapsed. (0/0/96 errors)                     
923347652                                   
923347653                                   
...
923348157                                   
923348158                                   
923348159                                   
done                                       
Testing with pattern 0xff: done           
Reading and comparing: done                 
Testing with pattern 0x00: 48.74% done, 46:57:38 elapsed. (0/0/605 errors)           


This morning, I also got an email saying this
Code:
freenas-backup.local kernel log messages:
(ada4:ata0:0:0:0): WRITE_DMA48. ACB: 35 00 00 64 49 40 b8 01 00 00 00 01
(ada4:ata0:0:0:0): CAM status: Command timeout
(ada4:ata0:0:0:0): Retrying command
(ada4:ata0:0:0:0): READ_DMA48. ACB: 25 00 00 cb 06 40 49 01 00 00 00 01
(ada4:ata0:0:0:0): CAM status: ATA Status Error
(ada4:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 84 (ICRC ABRT )
(ada4:ata0:0:0:0): RES: 51 84 d0 cb 06 49 49 01 00 30 00
(ada4:ata0:0:0:0): Retrying command

-- End of security output --

badblocks isn't done yet. However, the disk is definitely bad, right? Shall I abort it and start the RMA process, or should I wait for badblocks to complete so the next log SMART test will actually fail?

This disk was going to replace a disk which is failing fast (273 pending sectors + freenas says the disk is not capable of running SMART tests).

Thanks,
Saurav.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The disk is attached to my backup server (hp n36L, specs in sig) by eSATA and hosted in an Anker external HDD dock.
Do you have a fan blowing on that disk to keep it cool? A disk can easily fail from overheating if it is running badblocks with no airflow to pull the heat out. The badblocks test is very demanding.
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
No, I don't have a fan blowing on it _directly_. But the room has air conditioning half the day.

badblocks just finished, and the attributes are all normal. I have started the long SMART test. I will post the results once it is over.

Code:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   066   044    Pre-fail  Always       -       174167238
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   045    Pre-fail  Always       -       21287493
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       79 (78 48 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       3
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   051   040    Old_age   Always       -       33 (Min/Max 32/47)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       31
194 Temperature_Celsius     0x0022   033   049   000    Old_age   Always       -       33 (0 31 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       78 (25 237 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       31153822656
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       31256153965
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
The long SMART test didn't detect any errors:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   066   044    Pre-fail  Always       -       174167238
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   074   060   045    Pre-fail  Always       -       23675071
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       89 (89 243 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       3
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   051   040    Old_age   Always       -       38 (Min/Max 32/47)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       41
194 Temperature_Celsius     0x0022   038   049   000    Old_age   Always       -       38 (0 31 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       89 (20 162 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       31153822656
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       31256153965

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        86         -
# 2  Extended offline    Completed without error       00%         7         -
# 3  Conveyance offline  Completed without error       00%         0         -
# 4  Short offline       Completed without error       00%         0         -

Now I'm really confused. badblocks definitely found some of the sectors to be unreliable on this disk, and yet SMART says the disk is fine. Which should I trust?
 

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
My policy is not to trust SMART if it says all is well, but definitely trust it if it says things are bad.

On the other hand, with zeros for Reallocated_Sector_Ct, Current_Pending_Sector and Offline_Uncorrectable you don't have much to RMA on.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Do not use the eSATA and use an internal SATA cable. It is possible your interface is causing the issue but with a specific grouping of sectors, I'm not ruling out a real failure. Also when you retest you can specify a range of sectors, I'd focus firstly on 923335393 to 923348159, no need to run the entire disk surface if you already have locations which failed. I have an example of how to specify the sector range in the Hard Drive Troubleshooting Guide (see link in signature).

If you still have a failure then I'm not sure how you would RMA the drive. If it's mission critical data you are storing then don't use that drive in your system if it still fails.

Good Luck!
 

saurav

Contributor
Joined
Jul 29, 2012
Messages
139
Thanks @joeschmuck and @JaimieV . I realized what @JaimieV said above, i.e. unless there's a failing SMART test, I can't RMA it. So I've put it in the box which will ensure regular SMART tests. If any of the long tests fail, I'm going to RMA it immediately. Thanks for your advise.
 
Top