Disk error - gui says "an error occurred"

Osiris · Apr 10, 2015

Hi lads.
Could I have some advice please.

Hardware involved (also in my sig) :
Nas5: MB: ASRock C2750D4I mini ITX 2 x Gigabit LAN 12 x SATA - CPU: (onboard) Intel Atom C2750 Octa Core - RAM: 16GB 1333 4 x 4GB Corsair Vengeance DDR3 - Gfx: (onboard) ASPEED AST2300 - HD: 1 x 1.5 TB, 4 x 2TB WD Red, 3 x 3TB WD Red SATA-600 - Boot: Crucial M500 Solid state drive 120 GB SATA-600 - Pools: 1 x JBoD (on 1 x 1.5TB) : 1.3 TB available, 1 x ZFS1 (on 4 x 2TB) : 5.2 TB available, 1 x ZFS1 (on 3 x 3TB) : 5.3 TB available - PSU: LC Power LC200SFX V3.21 SFX12V 230 Volt AC V 200 Watt - Case: SilverStone DS380

Code:

[root@ra] ~# uname -a
FreeBSD ra.local 9.3-RELEASE-p12 FreeBSD 9.3-RELEASE-p12 #2 r275790+e006356: Thu Mar 26 17:28:49 PDT 2015  root@build3.ixsystems.com:/tank/home/jkh/build/FN/objs/os-base/amd64/tank/home/jkh/build/FN/FreeBSD/src/sys/FREENAS.amd64  amd64

Yesterday my freenas machine rebooted, because of what I believe was 1 disk in a pool (3 x 3TB in zfs1) becoming too corrupt.
Well ... it didn't reboot ... nginx went down and I didn't know what was going on, so I tried to restart django.
Then it rebooted.
After a reboot the gui was up & I saw a degraded volume.

I can no longer show you the smart results but yesterday it was showing that ada1 had too many uncorrectable offline sectors.
I told myself to replace the disk a day later.
The volume isn't that critical, but it is holding my jails.

Now, one day later, the gui has become inaccessible again.
The corrupted disk has become inaccessible even from CLI.

Code:

smartctl -q noserial -a /dev/ada1

returns

Code:

/dev/ada1: Unable to detect device type

I've tried then with -d ata, -d sat, ... and everytime

Code:

Smartctl open device: /dev/ada1 failed: No such file or directory

There's more.

Code:

smartctl -h /dev/

is showing the other 2 disks of the pool correctly: ada2 & ada3 (and ada3p1, ada3p1.eli & ada3p2)

Gpart however, is not showing ada2 ??

Code:

[root@ra] ~# gpart show
=>  34  234441581  ada0  GPT  (111G)
  34  6  - free -  (3.0k)
  40  1024  1  bios-boot  (512k)
  1064  234440544  2  freebsd-zfs  (111G)
  234441608  7  - free -  (3.5k)

=>  34  5860533101  ada3  GPT  (2.7T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  5856338696  2  freebsd-zfs  (2.7T)
  5860533128  7  - free -  (3.5k)

=>  34  3907029101  ada4  GPT  (1.8T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  3902834696  2  freebsd-zfs  (1.8T)
  3907029128  7  - free -  (3.5k)

=>  34  3907029101  ada5  GPT  (1.8T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  3902834696  2  freebsd-zfs  (1.8T)
  3907029128  7  - free -  (3.5k)

=>  34  2930274988  ada6  GPT  (1.4T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  2926080590  2  freebsd-zfs  (1.4T)

=>  34  3907029101  ada7  GPT  (1.8T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  3902834696  2  freebsd-zfs  (1.8T)
  3907029128  7  - free -  (3.5k)

=>  34  3907029101  ada8  GPT  (1.8T)
  34  94  - free -  (47k)
  128  4194304  1  freebsd-swap  (2.0G)
  4194432  3902834696  2  freebsd-zfs  (1.8T)
  3907029128  7  - free -  (3.5k)

I can still see the contents of the volume with an ls command.

The missing volume isn't showing anything bad.

Code:

smartctl -q noserial -a /dev/ada2
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity:  3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Fri Apr 10 09:50:03 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (41760) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 418) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x703d) SCT Status supported.
  SCT Error Recovery Control supported.
  SCT Feature Control supported.
  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  100  253  021  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  1
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  095  095  000  Old_age  Always  -  3731
10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  1
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  0
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  2727
194 Temperature_Celsius  0x0022  107  105  000  Old_age  Always  -  43
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  1
200 Multi_Zone_Error_Rate  0x0008  100  253  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Code:

[root@ra] ~# camcontrol devlist
<Crucial CT120M500SSD1 MU03>  at scbus0 target 0 lun 0 (pass0,ada0)
<WDC WD30EFRX-68EUZN0 82.00A82>  at scbus3 target 0 lun 0 (pass2,ada2)
<ST3000DM001-9YN166 CC82>  at scbus4 target 0 lun 0 (pass3,ada3)
<Marvell Console 1.01>  at scbus9 target 0 lun 0 (pass4)
<WDC WD20EFRX-68EUZN0 80.00A80>  at scbus10 target 0 lun 0 (pass5,ada4)
<WDC WD20EFRX-68EUZN0 80.00A80>  at scbus11 target 0 lun 0 (pass6,ada5)
<ST31500341AS CC1H>  at scbus12 target 0 lun 0 (pass7,ada6)
<WDC WD20EFRX-68AX9N0 80.00A80>  at scbus14 target 0 lun 0 (pass8,ada7)
<WDC WD20EFRX-68AX9N0 80.00A80>  at scbus15 target 0 lun 0 (pass9,ada8)

Code:

[root@ra] ~# glabel status
  Name  Status  Components
gptid/d9111221-a0af-11e4-9d6e-d05099192cef  N/A  ada0p1
gptid/2a3bb34e-eb0c-11e3-bbde-d05099192cef  N/A  ada3p2
gptid/4de54f87-eb0c-11e3-bbde-d05099192cef  N/A  ada4p2
gptid/4eb81cbe-eb0c-11e3-bbde-d05099192cef  N/A  ada5p2
gptid/8f32a4c8-eab5-11e3-86b4-d05099192cef  N/A  ada6p2
gptid/4f39d254-eb0c-11e3-bbde-d05099192cef  N/A  ada7p2
gptid/4fbe08bf-eb0c-11e3-bbde-d05099192cef  N/A  ada8p2

Code:

[root@ra] ~# dmesg | grep ada1
(ada1:ahcich0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 e0 8a 95 40 0d 00 00 00 00 00
(ada1:ahcich0:0:0:0): CAM status: Command timeout
(ada1:ahcich0:0:0:0): Retrying command
(ada1:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich0:0:0:0): CAM status: Command timeout
(ada1:ahcich0:0:0:0): Retrying command
(ada1:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich0:0:0:0): CAM status: Command timeout
(ada1:ahcich0:0:0:0): Retrying command
(ada1:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
(ada1:ahcich0:0:0:0): CAM status: Command timeout
(ada1:ahcich0:0:0:0): Retrying command
ada1 at ahcich0 bus 0 scbus2 target 0 lun 0
ada1: <ST3000DM001-9YN166 CC4B> s/n W1F0G8L1 detached
GEOM_ELI: Device ada1p1.eli destroyed.
GEOM_ELI: Detached ada1p1.eli on last close.
(ada1:ahcich0:0:0:0): Periph destroyed
[root@ra] ~# dmesg | grep ada2
[root@ra] ~# dmesg | grep ada3

I just rebooted and got back into the GUI.
About 200 checksum errors extra for ada1

So ... what is this?
Why is the GUI going down?
I have to smell that there's a disk issue? :-D

Osiris · Apr 10, 2015

I do a smartctl 1 min later and am seeing +1600 offline_uncorrectable sectors.
Ok, ok, the disk is failing.
Still, the web GUI shouldn't be inaccessible.
What is happening? Nginx is still running at that fail-moment, so this is a riddle to me.
One of the main points of FreeNas should be, imho, that these kind of situations should be manageable.
Now I'm back at CLI handling of failing raidsets.

Code:

[root@ra] ~# smartctl -q noserial -a /dev/ada1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Seagate Barracuda 7200.14 (AF)
Device Model:  ST3000DM001-9YN166
Firmware Version: CC4B
User Capacity:  3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  7200 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Fri Apr 10 10:57:45 2015 CEST

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
  was completed without error.
  Auto Offline Data Collection: Enabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (  575) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  1) minutes.
Extended self-test routine
recommended polling time:  ( 333) minutes.
Conveyance self-test routine
recommended polling time:  (  2) minutes.
SCT capabilities:  (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000f  096  095  006  Pre-fail  Always  -  33116472
  3 Spin_Up_Time  0x0003  093  092  000  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  63
  5 Reallocated_Sector_Ct  0x0033  097  097  036  Pre-fail  Always  -  4280
  7 Seek_Error_Rate  0x000f  075  060  030  Pre-fail  Always  -  40939031
  9 Power_On_Hours  0x0032  079  079  000  Old_age  Always  -  18724
10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  72
183 Runtime_Bad_Block  0x0032  100  100  000  Old_age  Always  -  0
184 End-to-End_Error  0x0032  100  100  099  Old_age  Always  -  0
187 Reported_Uncorrect  0x0032  086  086  000  Old_age  Always  -  14
188 Command_Timeout  0x0032  100  097  000  Old_age  Always  -  7 7 7
189 High_Fly_Writes  0x003a  098  098  000  Old_age  Always  -  2
190 Airflow_Temperature_Cel 0x0022  052  047  045  Old_age  Always  -  48 (Min/Max 36/51)
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  0
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  50
193 Load_Cycle_Count  0x0032  095  095  000  Old_age  Always  -  10193
194 Temperature_Celsius  0x0022  048  053  000  Old_age  Always  -  48 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012  090  090  000  Old_age  Always  -  1672
198 Offline_Uncorrectable  0x0010  090  090  000  Old_age  Offline  -  1672
199 UDMA_CRC_Error_Count  0x003e  200  200  000  Old_age  Always  -  0
240 Head_Flying_Hours  0x0000  100  253  000  Old_age  Offline  -  18534h+09m+06.701s
241 Total_LBAs_Written  0x0000  100  253  000  Old_age  Offline  -  49351461412495
242 Total_LBAs_Read  0x0000  100  253  000  Old_age  Offline  -  113992645361

SMART Error Log Version: 1
ATA Error Count: 4
  CR = Command Register [HEX]
  FR = Features Register [HEX]
  SC = Sector Count Register [HEX]
  SN = Sector Number Register [HEX]
  CL = Cylinder Low Register [HEX]
  CH = Cylinder High Register [HEX]
  DH = Device/Head Register [HEX]
  DC = Device Command Register [HEX]
  ER = Error register [HEX]
  ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4 occurred at disk power-on lifetime: 18707 hours (779 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 72 98 0d  Error: UNC at LBA = 0x0d987280 = 228094592

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 d0 a8 cf da 45 00  5d+20:26:10.281  READ FPDMA QUEUED
  60 00 08 80 72 98 4d 00  5d+20:26:10.281  READ FPDMA QUEUED
  60 00 f8 b0 ce da 45 00  5d+20:26:10.274  READ FPDMA QUEUED
  60 00 30 20 b1 9b 45 00  5d+20:26:10.274  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  5d+20:26:10.208  READ FPDMA QUEUED

Error 3 occurred at disk power-on lifetime: 18707 hours (779 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 30 88 95 0d  Error: UNC at LBA = 0x0d958830 = 227903536

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 48 10 88 95 4d 00  5d+20:17:08.998  READ FPDMA QUEUED
  60 00 08 78 b1 f6 42 00  5d+20:17:08.998  READ FPDMA QUEUED
  60 00 18 c0 be ed 42 00  5d+20:17:08.983  READ FPDMA QUEUED
  60 00 18 68 98 43 40 00  5d+20:17:08.982  READ FPDMA QUEUED
  60 00 10 88 96 43 40 00  5d+20:17:08.960  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 16247 hours (676 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  2d+18:36:12.545  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  2d+18:36:12.544  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00  2d+18:36:12.543  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00  2d+18:36:12.543  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00  2d+18:36:12.540  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 16247 hours (676 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00  2d+18:24:46.887  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  2d+18:24:46.885  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00  2d+18:24:46.885  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00  2d+18:24:46.885  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  2d+18:24:46.883  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  18628  -
# 2  Short offline  Completed without error  00%  18533  -
# 3  Short offline  Completed without error  00%  18461  -
# 4  Short offline  Completed without error  00%  18366  -
# 5  Short offline  Completed without error  00%  18296  -
# 6  Short offline  Completed without error  00%  18196  -
# 7  Short offline  Completed without error  00%  18126  -
# 8  Short offline  Completed without error  00%  18031  -
# 9  Short offline  Completed without error  00%  17956  -
#10  Short offline  Completed without error  00%  17861  -
#11  Short offline  Completed without error  00%  17791  -
#12  Short offline  Completed without error  00%  17693  -
#13  Short offline  Completed without error  00%  17623  -
#14  Short offline  Completed without error  00%  17528  -
#15  Short offline  Completed without error  00%  17453  -
#16  Short offline  Completed without error  00%  17360  -
#17  Short offline  Completed without error  00%  17285  -
#18  Short offline  Completed without error  00%  17190  -
#19  Short offline  Completed without error  00%  17120  -
#20  Short offline  Completed without error  00%  17025  -
#21  Short offline  Completed without error  00%  16950  -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

joeschmuck · Apr 10, 2015

On your ada1 drive, IDs 197 and 198 are key indicators that your hard drive is failing fast. Replace it. Not sure about the GUI issue but the hard drive should be your first priority.

EDIT: Since this is a RAIDZ1 pool, I'd also backup any important data ASAP!

danb35 · Apr 10, 2015

...of course, if the GUI is dead, it's hard to replace the disk properly...

@Osiris, you're right, a failing disk should not kill the web GUI, and there are plenty of folks here (including me) who have had one or more disks fail, and not had it kill the web GUI. A failing boot device might cause the problems you're seeing, though. Try doing a scrub of the boot pool from the command line (zpool scrub freenas-boot), wait a couple of minutes, and then post the output of 'zpool status' in code tags.

Edit: Of course, if nginx is giving an error, it might be good to see what it is. The nginx error log is in /var/log/nginx-error.log.

Osiris · Apr 10, 2015

/var/log/nginx-error.log
contains the following. Note that I might have replaced some fqdn's.

Code:

2015/04/09 12:42:30 [error] 63712#0: *97577 kevent() reported that connect() failed (61: Connection refused) while connecting to upstream, client: 10.10.1.11, server: localhost, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9042", host: "ra"
2015/04/09 12:42:30 [error] 63712#0: *97577 kevent() reported that connect() failed (61: Connection refused) while connecting to upstream, client: 10.10.1.11, server: localhost, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9042", host: "ra"
2015/04/10 09:36:01 [error] 99897#0: *31680 upstream prematurely closed connection while reading response header from upstream, client: 134.54.0.9, server: localhost, request: "GET /admin/alert/status/?1428651360560 HTTP/1.1", upstream: "fastcgi://127.0.0.1:9042", host: "banzai.ninja:58080", referrer: "http://banzai.ninja:58080/"

The last message keeps getting repeated sporadically.

zpool status :

Code:

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Fri Apr 10 15:29:05 2015
config:

  NAME  STATE  READ WRITE CKSUM
  freenas-boot  ONLINE  0  0  0
  ada0p2  ONLINE  0  0  0

errors: No known data errors

  pool: z3x3TB
state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
  attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
  using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 1.84M in 0h20m with 0 errors on Thu Apr  9 18:21:48 2015
config:

  NAME  STATE  READ WRITE CKSUM
  z3x3TB  DEGRADED  0  0  0
  raidz1-0  DEGRADED  0  0  0
  gptid/294b48c8-eb0c-11e3-bbde-d05099192cef  DEGRADED  0  0  722  too many errors
  ada2  ONLINE  0  0  0
  gptid/2a3bb34e-eb0c-11e3-bbde-d05099192cef  ONLINE  0  0  0

errors: No known data errors

I cannot offline the degraded disk in the gui. Does this mean it already is?

btw, ada2 was previously replaced (about a year ago).
FreeNas definately had an older version when assembling this NAS. I'm pretty sure it was still FreeNas8.x.
Hence the difference in name. The older ones still have the gptid/serial-jibberish.

danb35 · Apr 10, 2015

Clearly one of your disks is failing, but you already know that. I suspect you also did something at the command line to add ada2 to the pool, which could cause trouble down the road, but isn't likely to be the current problem with the web GUI.

Which pool houses your .system dataset? I ask only because it's the only way I can think of that pool integrity might affect the GUI.

What version of FreeNAS are you running? Can you try doing a clean install to another USB stick and importing your pool? I wouldn't try importing your config file at this point, just in case that's borked and that's the cause of your GUI problems. That should bring you to a point where you can replace the failing disk, following the manual's instructions.

Edit: I also note that this disk hasn't had a single SMART self-test run on it in the over two years it's been running. Once you get it replaced, make sure you have scheduled SMART tests for all your disks, and that you've set up email reporting of SMART errors.

Osiris · Apr 10, 2015

btw, thanks for your answers.

Indeed. The pool in question is the system dataset pool.
I'm at FreeNas 9.3 (as the dumped 'uname -a' results show in my first post).

I've replaced and resilvered disks before using CLI, so no worries there.
I was just looking for a cause to the GUI issue, which you probably indicated (although I still think the GUI shouldn't become unavailable, even when the degraded pool is housing the system dataset).

I'll let you know (here) how the rebuild turns out.

danb35 · Apr 10, 2015

Sure, it's easy enough to replace a disk at the command line, but then you end up with things done in a way that FreeNAS doesn't expect. That's why gpart doesn't show anything for ada2--you replaced it at the command line, didn't partition it the way FreeNAS does, and thus there's no partition table for gpart to show.

I agree that the GUI shouldn't die in this case. The pool is still available, and no data has been lost. I'm not at all convinced the two things are related, actually. That's why I also suggested trying a clean installation on a fresh USB stick, and doing the replacement there.

Osiris · Apr 10, 2015

I'm too scared to try that at once. Reimporting volumes, is that proven safe?
I was thinking of replacing mainboard, cpu & memory in my succubus nas, so I can add much more memory.
I'm a bit scared of losing my 35TB data collection however.

What I'll do (with this nas) is
1. replace the disk.
2. resilver the volume (should be quick)
3. move the jails to another volume.
4. destroy the original volume.
5. add another 3TB disk to the system.
6. create a Zfs2 pool out of the 4 x 3TB disks.
7. move the jails to this new volume.

Worst case scenario: I recreate all my jails ... again.

Bidule0hm · Apr 10, 2015

You don't have any backups of the data?

cyberjock · Apr 10, 2015

So yeah, the webgui isn't supposed to die. But, it can die if the .system dataset becomes unresponsive (or at least sufficiently slow that files that are used on the .system dataset are timing out or unavailable).

I'm betting if you physically remove the failed disk from the system and reboot you'll find the webgui is fully functional again.

Osiris · Apr 12, 2015

Spot on, cyberjock.
Now I just have to do the jail-moving, which is kind of a drag.
Setting your jails to another location in the gui, should simply move them, don'tchathink?

SweetAndLow · Apr 12, 2015

No, seeing the jail location will not move them. You will have to copy them then change the jail location.

Important Announcement for the TrueNAS Community.

Disk error - gui says "an error occurred"

Osiris

Contributor

Osiris

Contributor

joeschmuck

Old Man

danb35

Hall of Famer

Osiris

Contributor

danb35

Hall of Famer

Osiris

Contributor

danb35

Hall of Famer

Osiris

Contributor

Bidule0hm

Server Electronics Sorcerer

cyberjock

Inactive Account

Osiris

Contributor

SweetAndLow

Sweet'NASty

Similar threads

Important Announcement for the TrueNAS Community.

Disk error - gui says "an error occurred"

Contributor

Contributor

Old Man

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Server Electronics Sorcerer

Inactive Account

Contributor

Sweet'NASty

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk error - gui says "an error occurred""

Similar threads