Busy drive slows down network performance

michen · Jan 17, 2019

Hello everyone,

I've come across the following problem: One of the drive of my zfs pool shows irregularly high activities in contrast to the rest. I made a snapshot of the condition:

As you can see, ada1 seems to be at max. load, while the others (ada0, [ada2, .. , ada12]) are pretty much at the same level. In consequence, network traffic slows down by 1:10 during file transfer. When I reboot FreeNAS, the problem is gone at first, then it reoccurs sometime during another network file transfer.

The FreeNAS version that I use is FreeNAS-11.1-U6.3, with a total capacity of ~20TB. It's running on an Intel Q9650 with 8GB of RAM, and it is connected to a GigBit LAN.

smartctl -a /dev/ada1 throws:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N4ALNCTU
LU WWN Device Id: 5 0014ee 261d0f6c7
Firmware Version: 82.00A82
User Capacity: 3.000.592.982.016 bytes [3,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Jan 17 13:36:38 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: (39720) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 399) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 179 179 021 Pre-fail Always - 6008
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 064 064 000 Old_age Always - 26577
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 22
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 193 193 000 Old_age Always - 23093
194 Temperature_Celsius 0x0022 123 115 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 26574 -
# 2 Short offline Interrupted (host reset) 90% 26573 -
# 3 Short offline Completed without error 00% 26467 -
# 4 Short offline Completed without error 00% 26466 -
# 5 Short offline Completed without error 00% 26465 -
# 6 Short offline Completed without error 00% 26464 -
# 7 Short offline Completed without error 00% 26463 -
# 8 Short offline Completed without error 00% 26462 -
# 9 Short offline Completed without error 00% 26461 -
#10 Short offline Completed without error 00% 26460 -
#11 Short offline Completed without error 00% 26459 -
#12 Short offline Completed without error 00% 26458 -
#13 Short offline Completed without error 00% 26457 -
#14 Short offline Completed without error 00% 26456 -
#15 Short offline Completed without error 00% 26455 -
#16 Short offline Completed without error 00% 26454 -
#17 Short offline Completed without error 00% 26453 -
#18 Short offline Completed without error 00% 26452 -
#19 Short offline Completed without error 00% 26451 -
#20 Short offline Completed without error 00% 26450 -
#21 Short offline Completed without error 00% 26449 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Any idea why this is happening?

Chris Moore · Jan 17, 2019

I have seen this before. That's a drive that is having trouble mechanically. If it is still in warranty, you should have it replaced.

SweetAndLow · Jan 17, 2019

What is your smart test schedule? It seems very frequent.

Chris Moore · Jan 17, 2019

PS. Please use code tags, but without the spaces, to enclose lists like this: [ code ] [ /code ]

Code:

smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N4ALNCTU
LU WWN Device Id: 5 0014ee 261d0f6c7
Firmware Version: 82.00A82
User Capacity:    3.000.592.982.016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Jan 17 13:36:38 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 241)    Self-test routine in progress...
                    10% of test remaining.
Total time to complete Offline
data collection:         (39720) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 399) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   179   021    Pre-fail  Always       -       6008
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       22
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -       26577
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       12
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       23093
194 Temperature_Celsius     0x0022   123   115   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     26574         -
# 2  Short offline       Interrupted (host reset)      90%     26573         -
# 3  Short offline       Completed without error       00%     26467         -
# 4  Short offline       Completed without error       00%     26466         -
# 5  Short offline       Completed without error       00%     26465         -
# 6  Short offline       Completed without error       00%     26464         -
# 7  Short offline       Completed without error       00%     26463         -
# 8  Short offline       Completed without error       00%     26462         -
# 9  Short offline       Completed without error       00%     26461         -
#10  Short offline       Completed without error       00%     26460         -
#11  Short offline       Completed without error       00%     26459         -
#12  Short offline       Completed without error       00%     26458         -
#13  Short offline       Completed without error       00%     26457         -
#14  Short offline       Completed without error       00%     26456         -
#15  Short offline       Completed without error       00%     26455         -
#16  Short offline       Completed without error       00%     26454         -
#17  Short offline       Completed without error       00%     26453         -
#18  Short offline       Completed without error       00%     26452         -
#19  Short offline       Completed without error       00%     26451         -
#20  Short offline       Completed without error       00%     26450         -
#21  Short offline       Completed without error       00%     26449         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It makes it easier to read. Thanks

Chris Moore · Jan 17, 2019

SweetAndLow said:
What is your smart test schedule? It seems very frequent.

Yes, that is a test every hour.
@michen , that is probably way too often... Once a day should be fine, that is what I do and a long test once a week

michen · Jan 17, 2019

I exchanged the presumably faulty drive with a new one of the same model. Just happened to have another two in stock. FreeNAS is now in the process of resilvering. I will report back as soon as this has been covered.

michen · Jan 18, 2019

Update, approx. 24h later: Process of resilvering taking longer than expected. I take it that writing to ada1 while all others are reading means that it hasn't finished yet...

Anyway, in the meantime I had to reboot because the server became unresponsive from outside (webinterface, ssh, smb). I made a snapshot of what I last saw on the monitor before I issued a reboot on the machine from the console.

Will keep you posted.

SweetAndLow · Jan 18, 2019

Do you have jails or vm's running on this machine? Looks like you are swapping out memory.

michen · Jan 20, 2019

SweetAndLow said:
Do you have jails or vm's running on this machine? Looks like you are swapping out memory.

Yes, in fact I do. But only experimtal-wise. So I guess I would need some more RAM to face that issue?

michen · Jan 20, 2019

Monday Morning Update: Resilvering finished on friday evening. Drive activity seems back to normal, except that the alert is still flashing red: "Critical status. Pool is in a degraded state [...]".

I suppose after putting the drive back in, I made a mistake adding that drive to the pool. It didn't happen automatically, so I turned to the volume manager and supposedly chose a wrong option. Anyway, here's the output of zpool status:

Code:

root@aoslonas:~ # zpool status
  pool: data
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 1,61T in 1 days 05:28:00 with 0 errors on Fri Jan 18 19:53:35 2019
config:

        NAME                                              STATE     READ WRITE CKSUM
        data                                              DEGRADED     0     0     0
          raidz3-0                                        DEGRADED     0     0     0
            gptid/78a8d143-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            spare-1                                       DEGRADED     0     0     0
              9513749733849535178                         UNAVAIL      0     0     0  was /dev/gptid/79d4b769-8ed0-11e6-9f4d-001e8c745c80
              gptid/4dc08a0d-1a5b-11e9-83d6-001e8c745c80  ONLINE       0     0     0
            gptid/7b038b10-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7be32282-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7d0b8edf-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7e356607-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7f437239-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/804e146d-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/8159a565-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/82613ec8-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/836b0c26-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
        spares
          15578953262426563672                            INUSE     was /dev/gptid/4dc08a0d-1a5b-11e9-83d6-001e8c745c80

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:16 with 0 errors on Fri Jan 18 03:47:16 2019
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/c8f1958d-89ed-11e5-9527-0015609d8e50  ONLINE       0     0     0

errors: No known data errors

What can I do to make "spares" part of the degraded pool?

michen · Jan 21, 2019

michen said:

Monday Morning Update: Resilvering finished on friday evening. Drive activity seems back to normal, except that the alert is still flashing red: "Critical status. Pool is in a degraded state [...]".

I suppose after putting the drive back in, I made a mistake adding that drive to the pool. It didn't happen automatically, so I turned to the volume manager and supposedly chose a wrong option. Anyway, here's the output of zpool status:

Code:

root@aoslonas:~ # zpool status
  pool: data
state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 1,61T in 1 days 05:28:00 with 0 errors on Fri Jan 18 19:53:35 2019
config:

        NAME                                              STATE     READ WRITE CKSUM
        data                                              DEGRADED     0     0     0
          raidz3-0                                        DEGRADED     0     0     0
            gptid/78a8d143-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            spare-1                                       DEGRADED     0     0     0
              9513749733849535178                         UNAVAIL      0     0     0  was /dev/gptid/79d4b769-8ed0-11e6-9f4d-001e8c745c80
              gptid/4dc08a0d-1a5b-11e9-83d6-001e8c745c80  ONLINE       0     0     0
            gptid/7b038b10-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7be32282-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7d0b8edf-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7e356607-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/7f437239-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/804e146d-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/8159a565-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/82613ec8-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
            gptid/836b0c26-8ed0-11e6-9f4d-001e8c745c80    ONLINE       0     0     0
        spares
          15578953262426563672                            INUSE     was /dev/gptid/4dc08a0d-1a5b-11e9-83d6-001e8c745c80

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:02:16 with 0 errors on Fri Jan 18 03:47:16 2019
config:

        NAME                                          STATE     READ WRITE CKSUM
        freenas-boot                                  ONLINE       0     0     0
          gptid/c8f1958d-89ed-11e5-9527-0015609d8e50  ONLINE       0     0     0

errors: No known data errors

What can I do to make "spares" part of the degraded pool?

Helped myself by detaching the spare INUSE. Everything's back to normal now.

Thanks to @Chris Moore and @SweetAndLow for their input.

Cheers!
michen

Important Announcement for the TrueNAS Community.

Busy drive slows down network performance

michen

Dabbler

Chris Moore

Hall of Famer

SweetAndLow

Sweet'NASty

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

michen

Dabbler

michen

Dabbler

SweetAndLow

Sweet'NASty

michen

Dabbler

michen

Dabbler

michen

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Busy drive slows down network performance

Dabbler

Hall of Famer

Sweet'NASty

Hall of Famer

Hall of Famer

Dabbler

Dabbler

Sweet'NASty

Dabbler

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Busy drive slows down network performance"

Similar threads