Scrub cause WebGUI /SSH hang Normal?

DaveY · Aug 31, 2015

One of our production FN iSCSI servers was inaccessible this morning. Both WebGUI and SSH hang during login attempts. The WebGUI doesn't even reach the login screen. At first I thought the whole box was hung up, but tried one last effort of just doing a ssh root@host 'zpool status -v' and that worked fine. Turns out the system is still finishing up a scrub from the weekend.

We have a second iSCSI server on identical hardware and doesn't seem to have issues. I know disk scrubs affect system performance, but should it hang up the GUI and ssh logins? Anyone else experience similar behavior on their box? A little worried it may be something more serious.

Hardware
DELL Precision R5500
Dual Intel E5645 @ 2.40GHz
32GB RAM
PERC 6i with 4 x Seagate 2TB drives in RAIDZ-1

DifferentStrokes · Aug 31, 2015

Do you run regular SMART test? If so, anything in the logs look out of place?

DaveY · Aug 31, 2015

DifferentStrokes said:
Do you run regular SMART test? If so, anything in the logs look out of place?

Yes, they run twice a month. No errors on any of the drives.

cyberjock · Sep 1, 2015

Well, Perc 6i really scares me. That's a RAID card, and if that thing has problems (and you won't know you are having problems since FreeNAS isn't designed to work with RAID cards) that can hang up the zpool, then it'll hang your .system dataset on the zpool, and that *will* cause all of the behavior you are experiencing.

In any case, this sounds like some kind of storage subsystem issue, but you'll have to figure out what the issue is exactly via trial and error.

DaveY · Sep 2, 2015

@cyberjock you might be on to something.

Was finally able to log in after the scrub finished. Checked the SMART data on all my drives and noticed the load_cycle_count is quite high for the age of the drive. SMART tests still came back with no warnings or errors.

Is there an 8 second bug similar to the WD Reds on these 2.5" Samsung SpinPoints and is there a way to fix it? Weird thing is I have another identical R5500 with same hardware, controller, and drives, but none of the drives have such a high cycle count (114609 vs 14228)

Troubling...

SMART data from server that locked up during SCRUB:

Code:

=== START OF INFORMATION SECTION ===
Model Family:  Seagate Samsung SpinPoint M9T
Device Model:  ST2000LM003 HN-M201RAD
Serial Number:  ---
LU WWN Device Id: ---
Firmware Version: 2BC10006
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Form Factor:  2.5 inches
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:  Wed Sep  2 16:19:05 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
           was never started.
           Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0)   The previous self-test routine completed
           without error or no self-test has ever
           been run.
Total time to complete Offline
data collection:      (22200) seconds.
Offline data collection
capabilities:         (0x5b) SMART execute Offline immediate.
           Auto Offline data collection on/off support.
           Suspend Offline collection upon new
           command.
           Offline surface scan supported.
           Self-test supported.
           No Conveyance Self-test supported.
           Selective Self-test supported.
SMART capabilities:  (0x0003)   Saves SMART data before entering
           power-saving mode.
           Supports SMART auto save timer.
Error logging capability:  (0x01)   Error logging supported.
           General Purpose Logging supported.
Short self-test routine
recommended polling time:     (  1) minutes.
Extended self-test routine
recommended polling time:     ( 370) minutes.
SCT capabilities:     (0x003f)   SCT Status supported.
           SCT Error Recovery Control supported.
           SCT Feature Control supported.
           SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  100  100  051  Pre-fail  Always  -  1
  2 Throughput_Performance  0x0026  252  252  000  Old_age  Always  -  0
  3 Spin_Up_Time  0x0023  088  088  025  Pre-fail  Always  -  3830
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  20
  5 Reallocated_Sector_Ct  0x0033  252  252  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  252  252  051  Old_age  Always  -  0
  8 Seek_Time_Performance  0x0024  252  252  015  Old_age  Offline  -  0
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  5036
10 Spin_Retry_Count  0x0032  252  252  051  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  21
191 G-Sense_Error_Rate  0x0022  100  100  000  Old_age  Always  -  52
192 Power-Off_Retract_Count 0x0022  252  252  000  Old_age  Always  -  0
194 Temperature_Celsius  0x0002  064  064  000  Old_age  Always  -  24 (Min/Max 19/31)
195 Hardware_ECC_Recovered  0x003a  100  100  000  Old_age  Always  -  0
196 Reallocated_Event_Count 0x0032  252  252  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  252  252  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  252  252  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0036  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x002a  100  100  000  Old_age  Always  -  8
223 Load_Retry_Count  0x0032  100  100  000  Old_age  Always  -  7
225 Load_Cycle_Count  0x0032  089  089  000  Old_age  Always  -  114609

...

SMART data from an identical server, controller, and drives

Code:

=== START OF INFORMATION SECTION ===
Model Family:  Seagate Samsung SpinPoint M9T
Device Model:  ST2000LM003 HN-M201RAD
Serial Number:  ---
LU WWN Device Id: ---
Firmware Version: 2BC10006
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Form Factor:  2.5 inches
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:  Wed Sep  2 16:32:03 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
           was never started.
           Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0)   The previous self-test routine completed
           without error or no self-test has ever
           been run.
Total time to complete Offline
data collection:      (22080) seconds.
Offline data collection
capabilities:         (0x5b) SMART execute Offline immediate.
           Auto Offline data collection on/off support.
           Suspend Offline collection upon new
           command.
           Offline surface scan supported.
           Self-test supported.
           No Conveyance Self-test supported.
           Selective Self-test supported.
SMART capabilities:  (0x0003)   Saves SMART data before entering
           power-saving mode.
           Supports SMART auto save timer.
Error logging capability:  (0x01)   Error logging supported.
           General Purpose Logging supported.
Short self-test routine
recommended polling time:     (  1) minutes.
Extended self-test routine
recommended polling time:     ( 368) minutes.
SCT capabilities:     (0x003f)   SCT Status supported.
           SCT Error Recovery Control supported.
           SCT Feature Control supported.
           SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  100  100  051  Pre-fail  Always  -  0
  2 Throughput_Performance  0x0026  054  054  000  Old_age  Always  -  20949
  3 Spin_Up_Time  0x0023  085  085  025  Pre-fail  Always  -  4604
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  15
  5 Reallocated_Sector_Ct  0x0033  252  252  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  252  252  051  Old_age  Always  -  0
  8 Seek_Time_Performance  0x0024  252  252  015  Old_age  Offline  -  0
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  1632
 10 Spin_Retry_Count  0x0032  252  252  051  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  15
191 G-Sense_Error_Rate  0x0022  252  252  000  Old_age  Always  -  0
192 Power-Off_Retract_Count 0x0022  252  252  000  Old_age  Always  -  0
194 Temperature_Celsius  0x0002  064  064  000  Old_age  Always  -  24 (Min/Max 19/29)
195 Hardware_ECC_Recovered  0x003a  100  100  000  Old_age  Always  -  0
196 Reallocated_Event_Count 0x0032  252  252  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  252  252  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  252  252  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0036  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x002a  100  100  000  Old_age  Always  -  5
223 Load_Retry_Count  0x0032  100  100  000  Old_age  Always  -  5
225 Load_Cycle_Count  0x0032  099  099  000  Old_age  Always  -  14228

...

Robert Trevellyan · Sep 2, 2015

DaveY said:
I have another identical R5500

The first drive has about 23 load cycles per hour. The second has about 9 per hour. Both seem too high to me. Do you have anything other than Always On in the Power Management settings for those drives?

Robert Trevellyan · Sep 2, 2015

Just checked my drives and they have less than 0.02 load cycles per hour.

cyberjock · Sep 3, 2015

Even if we assumed it was the drives with an 8 second timer similar to WD's "Intellipark" it would not result in the halting of the zpool as I described. Likewise, you said you were doing a scrub when the system went down. There's no way that more than 1-2 seconds wound have elapsed before the scrub I/O would have kicked in. Of course, your RAID controller throws many of the timing values out of wack because first ZFS tries to manage I/O and then your RAID controller does, so you've double managed your disks, which is bad joo-joo for reasons you probably wouldn't understand very well if I explained it anyway.

Long story short.. you're still stuck with ditching that RAID controller if you want to truly remove the most likely culprit. We keep telling people not to use them. They keep ignoring us, and so they end up with odd problems, lost data, damaged zpools, etc and there's no way to point the finger firmly at the RAID controller.

Robert Trevellyan · Sep 3, 2015

Never mind, I was thinking of a different controller.

Important Announcement for the TrueNAS Community.

Scrub cause WebGUI /SSH hang Normal?

DaveY

Contributor

DifferentStrokes

Patron

DaveY

Contributor

cyberjock

Inactive Account

DaveY

Contributor

Robert Trevellyan

Pony Wrangler

Robert Trevellyan

Pony Wrangler

cyberjock

Inactive Account

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Scrub cause WebGUI /SSH hang Normal?

Contributor

Patron

Contributor

Inactive Account

Contributor

Pony Wrangler

Pony Wrangler

Inactive Account

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Scrub cause WebGUI /SSH hang Normal?"

Similar threads