Volume degraded ? until serverrestart then online again for some time ...

Status
Not open for further replies.

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
Hi
I have a server with 2 pools. Recently i discoverd that warnings about my to pools beeing degraded saying that the pool is degraded due to a harddrive beeing removed. Ofcourse i havnt removed anything and my first thought was that the harddrive died so i run to the shop and bought one new harddrive for one of the pools.
But after a serverrestarded i noticed that the harddrive was there again and the pools marked as healthy and working perfectly (scrub done after that).
But later this happend again and the story repeats itself.
The drive dissapears, either in both pool or only in one, than i restart the server and suddenly the drive is there and everything is working fine. Excep there is a serious problem ofcourse and I am not sure how to tackle it, specially because i am new to freeNAS and dont know how to use commands unless in shellwindod while my connected pc is one.
Right now i am trying to do a longtest on the harddrive throught the gui to start in one hours from now on both those problematic drives (after reading some posts here it seems like step 1 to do ? specially because i havnt dont any longtests yet :oops: )
So what is the next step to do ?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Please post your hardware setup and the version of FreeNAS you are running. Thanks!
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
FreeNAS-9.3-STABLE
X10SL7-F supermicro
Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
32 GB ECC Ram )
ST Lab A-520 4 Channels Raid Card SATA 6Gbit/s (set in JBOD mode, connecting the last drive of the raidz2 pool)
750W goldcertified Be Quiet PSU

One raidz3 pool containing 5x Seagate (3TB 7200 rpm) all connected to the motherboards standard SATA ports.
One raidz2 pool containing 10x 4TB drives 5400 rpm (7 Seagate NAS, 2 WD RED and 1 WD green - the WD green is updated the green after all posts i read here about WD green and the powersaving mode)
The disks in this pool are connected as following:
1 disk connected to the regular motherboard SATA port, 8 disks connected to the motherboards LSI 2308 that is IT flashed P16 while the last disk, the 10th, is connected to ST Lab A-520 4 Channels Raid Card SATA port that is set to JBOD mode and acting as only additional SATA port.
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Can you get smart data for your drives? smartctl -a /dev/adaX

I also suspect your raid card is what is causing the problems.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
The RAID card is acting only as a SATA card (JBOD mode) because i needed 1 more SATA connection than offerd with the motherboard. So only one harddrive of all 15 drives is connected to the CARD. The rest of the drives are connected to the motherboard. Not both prolematic cards are connected to the RAID JBOD Card.

First pool the raidz3 with 5x Seagate 3TB are connected to the motherboards standard SATA ports.
Second pool the raidz2 with 10X 4TB drives has 1 disk connected to the regular motherboard SATA port, 8 disks connected to the motherboards LSI 2308 that is IT flashed P16 while the last disk, the 10th, is connected to ST Lab A-520 4 Channels Raid Card SATA port that is set to JBOD mode and acting as only additional SATA port.


=== START OF INFORMATION SECTION ===
Model Family: Seagate NAS HDD
Device Model: ST4000VN000-1H4168
Serial Number: S300ZDL1
LU WWN Device Id: 5 000c50 0758a8f11
Firmware Version: SC46
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5900 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Mar 27 23:23:02 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 107) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 522) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 183804520
3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 18
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 14610215
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 675
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 18
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 055 045 Old_age Always - 29 (Min/Max 21/30)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 18
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 18
194 Temperature_Celsius 0x0022 029 045 000 Old_age Always - 29 (0 21 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 659 -
# 2 Extended offline Interrupted (host reset) 90% 649 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68WT0N0
Serial Number: WD-WCC4E0052665
LU WWN Device Id: 5 0014ee 25e5386f4
Firmware Version: 80.00A80
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Mar 27 23:12:27 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (56880) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 569) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x703d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 176 175 021 Pre-fail Always - 8175
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 38
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2220
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1384
194 Temperature_Celsius 0x0022 121 108 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2206 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
Now the server is again again showing critical warnings:
CRITICAL: The volume Second...pool (ZFS) state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Ofcourse nothing was removed by me. Smart data is provided in previous post.
What could the problem be ?
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554
Wich controller/sata port is the problem drive connected to?

Is it always the same drive dropping out?


Sent from my mobile using Tapatalk
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
Its 2 ports, 2 drives. I will open the case and check exactly which of the 3 portkinds it is. But the question for now is:
Does the SMART data in the previous post tell anything about the 2 disks issue or what the problem could be ?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
SMART data is fine, but you really should configure regular SMART tests.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
Are saying that the SMART data shows nothing strange for both drives?
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 183804520
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 14610215

I don't read or understand much of SMART data. But those values sounds very high for a harddrive in my ears. Or are those normal values and the drives are ok?
So that something else is the problem ?


Yes i realized that i should set up regular tests, even thought i heard long tests are demanding for a drive ?!
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Are saying that the SMART data shows nothing strange for both drives?
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 183804520
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 14610215

I don't read or understand much of SMART data. But those values sounds very high for a harddrive in my ears. Or are those normal values and the drives are ok?
So that something else is the problem ?


Yes i realized that i should set up regular tests, even thought i heard long tests are demanding for a drive ?!

The drives are Seagates, so that's normal. The numbers are close to meaningless unless you decode them.

As for the long tests, yes, they are demanding - that doesn't mean they shouldn't be run, though.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
so the drives are ok according to the SMART results? then what could the problem be in our case ?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
so the drives are ok according to the SMART results? then what could the problem be in our case ?

The ultra-crummy SATA card. Get an SAS expander or an additional LSI SAS 2008/2308 controller.
 

Sir.Robin

Guru
Joined
Apr 14, 2012
Messages
554
Cheap sucky cables? Trie others?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
Cable for the problematic driver replaced, after the restart the pool got degraded. So the cable seem ok.

The problematic drive is connected to the SATA Card. The same card is set to JBOD.
So either something is wrong with that specific drive, or something is wrong because of the presence of this CARDs SATA connection to the pool.
Except the fact that the problematic drive is connected to this card, is there anything else saying it is the CARD causing the problems?
I mean it is a 50 $ card and not the cheapest one out there.
I would like to be more sure before considering throwing a newly bought card just because it is not a LSI card and buy a new one.


SAS expander, doesnt that cost more than a new LSI card ?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
SAS expander, doesnt that cost more than a new LSI card ?
Generally, yes. That's why many people here just get a second SAS controller.

I'm afraid your card (like the vast majority of SATA PCI-e cards) uses a Marvell controller. Support for those ranges from "nonexistent" to "could use some improvement".
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
I understand what you saying, but does that matter if the CARD is used in non RAID mode, JBOD ?

Actually I came up with an idea to find out if the CARD is the problem or not :cool:
I am using only one SATA port from this SATA card. If I can skip the need of it so no harddrive is connected to it then I could easily after a few days be sure if that is the problem or not.
I can do that by recreating the first pool from raidz_3 to raidz_2, that will release one SATA port on my motherboard. Then I only have to disconnect the problematic harddrive in the second pool from that SATA card and connect it to the motherboard instead.
It means nothing will be connected to the SATA card, it will stay physically in the case attached to the motherboard but with no harddrives connected to it.
- The bad is I can not have my first pool as raidz_3 for some period until i know if the problem is a harddrive or the SATA CARD.
- The good is that I will be 100% sure if the problem is the SATA CARD or something else like the drive maybe ?!
- Condition is that the second pool must be functioning until the recreating of the first pool as raid_z2 pool because it will keep the data of the first pool.

I will disconnect the harddrive from the SATA port on the SATA card and connect it to a SATA port on the motherboard.
Question is: how do I do it in the GUI with respect to freeNAS that will maybe get confused because the drive was connected to a certain SATA port now suddenly is connected to another SATA port :rolleyes: ?
I am not replacing a harddrive, just substituting the connecting SATA port for a certain harddrive.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I understand what you saying, but does that matter if the CARD is used in non RAID mode, JBOD

Yes, it does.

Question is: how do I do it in the GUI with respect to freeNAS that will maybe get confused because the drive was connected to a certain SATA port now suddenly is connected to another SATA port :rolleyes: ?
I am not replacing a harddrive, just substituting the connecting SATA port for a certain harddrive.

This is not an issue.
 

arameen

Contributor
Joined
Sep 4, 2014
Messages
145
ok, that SATA card is not inside my case anymore. My first pool is degraded because I disconnected one drive to free one SATA port and thereby get rid of the SATA card while the second pool is healthy now.
After that, I created a snapshot of the first pool, sent it to the second pool. And after that i choosed through the GUI to rollback that snapshot of the firstpool inside the second pool. Apperently that deleted all content on the second pool and substituted it with the content of the first pool instead of recreating the firstpools data beside the existing data on the secondpool. Now 14 TB data on the secondpool is not there anymore :eek: There is only an exact copy of the first pool in the second pool :mad:
Did i just deleted the entire pools data on the second pool by rollbacking a snapshot of the first pool ? Its a little hard to think for me that this operation whiped out all data and that it is not recoverable :confused: ?!
Is there no way to revert the operation of rollingback that snapshot and thereby deleted the existing content? (No i dont have any snapshots of the second pool) :(
Or get back my 14TB data that is/was on the second pool ?
 
Status
Not open for further replies.
Top