SMART service reports CRITICAL errors, but disks seems fine.

Marcet · Feb 10, 2016

Hi all,

I've been using several FreeNAS server over the years without any major problem.
Running out of space on one of my server, I've decided to rebuild a new NAS.

Everything seems to work fine except SMART service which keeps reporting CRITICAL errors :

Code:

Device: /dev/ada2, Read SMART Self-Test Log Failed
Device: /dev/ada2, failed to read SMART Attribute Data
Device: /dev/ada3, Read SMART Self-Test Log Failed
Device: /dev/ada4, failed to read SMART Attribute Data
Device: /dev/ada5, Read SMART Error Log Failed
Device: /dev/ada5, Read SMART Self-Test Log Failed

Those errors seems to be communications errors between smartd and the disks when intensive writing occurs (I'm copying all my old server content to this NAS, and the process will last a very long time).

Here's the config of my NAS :

Code:

Chassis     : Lian Li PC-Q26
PSU         : Cooler Master G550M 80PLUS Bronze
MotherBoard : ASRock C2750D4I
Memory      : Kingston ValueRAM 32 Go (4 x 8 Go) DDR3 1600 MHz ECC CL11
Storage     : 10 x WD Red Desktop 6 To SATA 6Gb/s
Cache       : Crucial BX200 SSD 2,5" 240 Go SATA III

The disks are configured in one RaidZ2 vdev with the SSD for L2ARC.

I've not launched long SMART test on reported disks as it will last 12 hours on 6TB disks.
But I've made several short tests on each disks without major problems.

This is the first time I use 6TB Red drives (my older NAS had 3TB). Is it something special to do with this disks regarding FreeNAS ?
Am I in trouble with something ?

Thanks in advance for your help.

Here is the output of smartctl -a for each disk :

/dev/ada2

Code:

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       51
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   122   119   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        47         -
# 2  Short offline       Completed without error       00%        44         -
# 3  Short offline       Completed without error       00%        26         -
# 4  Extended offline    Aborted by host               10%        26         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/ada3

Code:

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       51
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   122   119   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        47         -
# 2  Short offline       Completed without error       00%        44         -
# 3  Short offline       Completed without error       00%        26         -
# 4  Short offline       Aborted by host               80%        26         -
# 5  Extended offline    Interrupted (host reset)      10%        26         -
# 6  Short offline       Completed without error       00%         6         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/ada4

Code:

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       51
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   121   118   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        47         -
# 2  Short offline       Completed without error       00%        44         -
# 3  Short offline       Completed without error       00%        34         -
# 4  Short offline       Interrupted (host reset)      80%        33         -
# 5  Short offline       Completed without error       00%        26         -
# 6  Extended offline    Aborted by host               10%        26         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/ada5

Code:

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       51
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   122   118   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        47         -
# 2  Short offline       Completed without error       00%        44         -
# 3  Short offline       Completed without error       00%        26         -
# 4  Short offline       Aborted by host               90%        26         -
# 5  Extended offline    Interrupted (host reset)      10%        25         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Best regards
--
Marc

Marcet · Feb 10, 2016

Oups. I forgot to specify the FreeNAS Build : FreeNAS-9.3-STABLE-201602031011

Bidule0hm · Feb 10, 2016

Thanks for following the forum rules, it's rare these days so a thanks is necessary I think :)

I don't see anything wrong apart from the interrupted long test.

NB: short tests are great for frequent/fast checks but not for debugging/real checks, they can miss a lot of things a long test will see.

The drives have been burned-in?

These errors don't say the drives are failing but that the system is failing at reading SMART attributes/logs. It's a problem but you're not the first one to have those errors, I recommend to search the forum to see if there's an answer (I can't recall the answer or even if there's is one, sorry) ;)

ChriZ · Feb 10, 2016

Could the failures be because of the Marvel ports?
Many advice against using those ports.
Is that disk perhaps connected to one of those?
I guess the only thing you can do is to change the sata port and see if the error follows the drive.

Marcet · Feb 10, 2016

Bidule0hm said:
Thanks for following the forum rules, it's rare these days so a thanks is necessary I think :)

I know what's to moderate a forum, so ... ;)

I don't see anything wrong apart from the interrupted long test.

Interrupted long test is due to a reboot. So not an issue at all.

NB: short tests are great for frequent/fast checks but not for debugging/real checks, they can miss a lot of things a long test will see.

I'm planning to schedule long tests, but as now I'm populating the drive, I don't want to slow down the process too much.

The drives have been burned-in?

Do you mean : Did I have tested the drive before use ? Answer is no (I know I might be wrong).
What would you recommend for future NAS build ?

These errors don't say the drives are failing but that the system is failing at reading SMART attributes/logs.

I know, that's why I'm not in panic mode right now :D

It's a problem but you're not the first one to have those errors, I recommend to search the forum to see if there's an answer (I can't recall the answer or even if there's is one, sorry) ;)

I did a research on the forum before posting. I'll dig it deeper.
But if you recall something, please let me know.

Thanks a lot for your answer, by the way.

Marcet · Feb 10, 2016

ChriZ said:
Could the failures be because of the Marvel ports?
Many advice against using those ports.

It's hard to identify (even when reading the manual) which disks are connected to Marvel ports.
But, thanks to point that out. I didn't thought about it.

Is that disk perhaps connected to one of those?
I guess the only thing you can do is to change the sata port and see if the error follows the drive.

Good advice. But here comes two questions :

1) Can I switch ports without problem ? Will FreeNAS recognize the zpool event with mixed sata ports ?

2) If I buy a SATA controller to replace the marvel's. What would you recommend ?

Bidule0hm · Feb 10, 2016

NB: I'm not a moderator, just a very active member :) I'm admin on another forum though so I know what's it is.

Marcet said:
Do you mean : Did I have tested the drive before use ? Answer is no (I know I might be wrong).
What would you recommend for future NAS build ?

Yes. That's a bad thing, drives can suffer from infant mortality. Follow what is said in this thread: https://forums.freenas.org/index.php?threads/building-burn-in-and-testing-your-freenas-system.17750/ ;)

Marcet said:
I did a research on the forum before posting. I'll dig it deeper.

Yeah the search feature on this forum is easily in the top 3 of the worst forums' search feature I have seen... Maybe use google or other search engine and add FreeNAS to your keywords.

Marcet said:
1) Can I switch ports without problem ? Will FreeNAS recognize the zpool event with mixed sata ports ?

Yes, the system don't care about that.

Marcet said:
2) If I buy a SATA controller to replace the marvel's. What would you recommend ?

Please read the hardware recommendation sticky (link is in my signature).

Marcet · Feb 10, 2016

Here's some updates...

I took a close look to the manual, using a magnifier as I passed 40 ;) And I was able to identify the ports.
All supposably faulty are connected to a Marvell controller. And moreover the same SE9230 4 ports.

I've found a link on ASRock Rack website pointing to a firmware update, but the link is dead :(

Marvell SE9230: 4 x SATA3 6.0 Gb/s, support RAID 0, 1, 10 (To update Marvell SE 9230 FW, please click here)

The most important is that, now I know. Thanks guys for that.

I'll continue to dig it, until I overcome this issue.
I'll keep you informed.

Marcet · Feb 10, 2016

@Bidule0hm By the way, I didn't notice... Que tu étais français ;)

Bidule0hm · Feb 10, 2016

Marcet said:
All supposably faulty are connected to a Marvell controller. And moreover the same SE9230 4 ports.

Yeah... Marvell controllers are like Realtek NICs: crap.

Marcet said:
I didn't notice... Que tu étais français

Yep :)

ChriZ · Feb 10, 2016

Cool... Seems like you found the root of the cause.
Perhaps the controller was causing some kind of timeouts, hence the errors.
And the thing to really consider about here, is that maybe it would not have happened if the system was not under heavy load. You were going to have a system that was perhaps not safe for your data and you wouldn't even know about it, until.....

Marcet · Feb 10, 2016

ChriZ said:
Cool... Seems like you found the root of the cause.

Yes, you were right about it. Thanks to point it out.

Perhaps the controller was causing some kind of timeouts, hence the errors.

Right.

And the thing to really consider about here, is that maybe it would not have happened if the system was not under heavy load.

Probably.

You were going to have a system that was perhaps not safe for your data and you wouldn't even know about it, until.....

Overall, I do not think my data are in any danger. At least I strongly hope.
I will have to keep an eye on those disks but I think I can manage.
The misconduct of the Marvell controller seems only regarding the SMART data reading.

Now I'm seeing three potential solutions :

1) I find a firmware update for the motherboard Marvell controller.
2) Some software update of FreeNAS would sometime solve the problem.
3) I find the right 4 SATA port PCI-E card.

JDCynical · Feb 10, 2016

Marcet said:
I've found a link on ASRock Rack website pointing to a firmware update, but the link is dead :(

Behold the power of the Internet! ;)

Marvell 9230 FW update Procedure
https://web.archive.org/web/20150927015706/ (which is a cached copy of http://www.asrockrack.com/support/ipmi.asp)

Using that info, I was able to dig around on the asrock web site and find the page under the 'download' option on the main board page, which has a link under the 'how to update' column and another freaking javascript thing to view the Marvell info:
http://www.asrockrack.com/support/faq.asp#InstantFlash

Marcet · Feb 10, 2016

Justin The Cynical said:
Behold the power of the Internet! ;)

Marvell 9230 FW update Procedure
https://web.archive.org/web/20150927015706/ (which is a cached copy of http://www.asrockrack.com/support/ipmi.asp)

Using that info, I was able to dig around on the asrock web site and find the page under the 'download' option on the main board page, which has a link under the 'how to update' column and another freaking javascript thing to view the Marvell info:
http://www.asrockrack.com/support/faq.asp#InstantFlash

Thanks. I will check that as soon as I can shutdown the NAS.

For the record, I'll also found those related threads :
https://forums.freenas.org/index.php?threads/enabling-s-m-a-r-t-and-viewing-test-results.26734/
https://forums.freenas.org/index.ph...t-getting-ahcichx-timeout-on-xx-port-0.26468/

Also I've found some additional clues in the morning log mail :

Code:

> ahcich2: Timeout on slot 19 port 0
> ahcich2: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 50 serr 00000000 cmd 10009317
> ahcich4: Timeout on slot 24 port 0
> ahcich4: is 00000000 cs 01000000 ss 00000000 rs 01000000 tfd 40 serr 00000000 cmd 10009817

ChriZ · Feb 11, 2016

Marcet said:
Overall, I do not think my data are in any danger. At least I strongly hope.
I will have to keep an eye on those disks but I think I can manage.

The thing with not so trusted hardware is that you can't know where it will impact the system.
Is anyone giving you guaranty that this is the only problem you are going to have using this configuration? Nope...
Perhaps that's the only issue you will ever encounter with the Marvel posts (I sure hope so), perhaps not...
Long story short, when I bump into issues like that, I am always hesitant to keep using it the hardware that is causing these issues. Especially when I need to trust my data with it.
(But that's just me..)

Marcet · Feb 11, 2016

For better illustration, I've modified the manual to have a clear understanding of which port is where.

Marcet · Feb 11, 2016

ChriZ said:
The thing with not so trusted hardware is that you can't know where it will impact the system.
Is anyone giving you guaranty that this is the only problem you are going to have using this configuration? Nope...
Perhaps that's the only issue you will ever encounter with the Marvel posts (I sure hope so), perhaps not...
Long story short, when I bump into issues like that, I am always hesitant to keep using it. (But that's just me..)

You're right. I have to find a durable solution.
For now, the NAS is still receiving data. As soon I can stop it I'll give a try to the FW update.

But if I hear you well, I should not trust Marvell anymore and get a trusty 6 SATA ports controller ?

danb35 · Feb 11, 2016

Marcet said:
But if I hear you well, I should not trust Marvell anymore and get a trusty 6 SATA ports controller ?

How many drives do you have? The board layout you posted shows six Intel-driven SATA ports, which should be perfectly safe. If you have more drives than that, then yes, you'd be best getting an HBA for those. The go-to recommendation around here is the LSI 9211-8i or one of its variants.

Marcet · Feb 11, 2016

danb35 said:
How many drives do you have? The board layout you posted shows six Intel-driven SATA ports, which should be perfectly safe. If you have more drives than that, then yes, you'd be best getting an HBA for those. The go-to recommendation around here is the LSI 9211-8i or one of its variants.

I've got 10x 6TB WD RED + 1x 256GB SSD for L2ARC.

I confirm the 6x Intel SATA port are working well.
The Marvell 9172 2x SATA port are also working well.

I start to consider getting one of these LSI board.
Are they all working with 6TB drives ?

danb35 · Feb 11, 2016

Marcet said:
I've got 10x 6TB WD RED + 1x 256GB SSD for L2ARC.

I'm sorry, you did mention that in your OP--I missed it. I'll note that it's unlikely your L2ARC is doing any good with only 32 GB of RAM, but that's a separate issue.

I'm using my LSI card with 4 TB disks, and it's working just fine, but I don't have any 6 TB disks. I don't think there's any difference there (there was something around the 2 TB mark with older controllers), but I still can't personally validate 6 TB disks for you.

Important Announcement for the TrueNAS Community.

SMART service reports CRITICAL errors, but disks seems fine.

Contributor

Contributor

Server Electronics Sorcerer

Patron

Contributor

Contributor

Server Electronics Sorcerer

Contributor

Contributor

Server Electronics Sorcerer

Patron

Contributor

Contributor

Contributor

Patron

Contributor

Contributor

Hall of Famer

Contributor

Hall of Famer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "SMART service reports CRITICAL errors, but disks seems fine."

Similar threads