Degraded Drive help

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
Looking for a little bit of help with this. Obviously stressing :/

Clicking on my pool on the dashboard shows that:
da2: UNAVAIL
RAIDZ2

CRITICAL
Pool tank6x6 state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state..
Sat, 29 Apr 2020 05:47:42 PM (America/Los_Angeles)

I read

drive msg1.jpg


All I see is old_age everywhere. Is this just a case of a drive going bad (bought 1/17/19). Did this drive just not boot properly and it might actually be ok? what is my next step. Probably ordering a 6TB tonight, hopefully I can get a little bit of help before then if it isn't a "bad" drive?


Thanks for any help, let me know if there is something else I can post.

EDIT:
I Just ran this test on all 6 of my hard drives... they all say old_age. Am I screwed on all hard drives and need to replace all 6?
 
Joined
Oct 18, 2018
Messages
969
The "Type" isn't what matters so much as the raw value. You have TONS of reallocated sectors on this drive; unless there is something I am missing; you should replace it. Are you running regular SMART tests and do you have email alerts set up? Those will help notify you before a drive fully bites the dust that it is on its way out.
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
I have never run a smart test, nor do I have email alerts set up. Sounds like tomorrows project.
I rebooted the system and everything is healthy again. I am wondering if this drive just didn't boot properly when I reset. I read a post from a guy that had a bad drive after a power outage, he was good after a reboot and someone just said he had checksum errors.

I had been resetting my freenas several times over the past 5 days trying hard, unsuccessfully, to get my new Chelsio T580-CR card to work. Maybe I just got the same checksum error on the last boot?

EDIT:
I just started the
smartctl –t long /dev/da2
test as in the guide i read. in 773 minutes i will see what it has to say.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hey Newtofreenas.

Don't worry about the Old-Age buzzword. It is only the type of test. Ex: counting how many stop / start the disk endured so far. After too many, the risk for it to fail increase. That is what an Old-Age type is about.

Code:
root@Atlas[~]# smartctl -a /dev/da3
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZDH54ZY3
LU WWN Device Id: 5 000c50 0b3193e02
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr 30 23:16:45 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  591) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 619) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       82524752
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       29
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       682502483
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       13074 (249 199 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       28
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   073   054   040    Old_age   Always       -       27 (Min/Max 7/41)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2410
194 Temperature_Celsius     0x0022   027   046   000    Old_age   Always       -       27 (0 7 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       13063 (162 154 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       44181975484
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       21468900644

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13009         -
# 2  Short offline       Completed without error       00%     12961         -
# 3  Short offline       Completed without error       00%     12913         -
# 4  Short offline       Completed without error       00%     12865         -
# 5  Short offline       Completed without error       00%     12817         -
# 6  Short offline       Completed without error       00%     12769         -
# 7  Short offline       Completed without error       00%     12721         -
# 8  Short offline       Completed without error       00%     12673         -
# 9  Short offline       Completed without error       00%     12625         -
#10  Short offline       Completed without error       00%     12577         -
#11  Extended offline    Completed without error       00%     12552         -
#12  Short offline       Completed without error       00%     12529         -
#13  Short offline       Completed without error       00%     12481         -
#14  Short offline       Completed without error       00%     12433         -
#15  Short offline       Completed without error       00%     12385         -
#16  Short offline       Completed without error       00%     12265         -
#17  Short offline       Completed without error       00%     12217         -
#18  Short offline       Completed without error       00%     12169         -
#19  Short offline       Completed without error       00%     12121         -
#20  Short offline       Completed without error       00%     12073         -
#21  Short offline       Completed without error       00%     12025         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


See ? All these settings are of the Old-Age type also for my own drives.

Look at test no5, re-allocated sector as pointed by @PhiloEpisteme : mine is at 0 and yours are over 10K. That is what you have to worry about.

So Yes, I would replace that drive. I would also look at all other drives to see how they look. Thanks to RaidZ2, you can not only loose a drive, but with a single drive missing, you still have redundancy.
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
I reran all the smartctl -a tests and that is the only drive with any reallocated sectors.
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
Hey Newtofreenas.

I sure like how you did your CODE section. You did a lot better posting your data than I did. I couldn't figure out how to copy and paste from the shell window, but I bet you got your info in seconds as opposed to my screenshot paste and edit that I did!
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Hi again,

To insert something like I did, you need to put it between code tag. That function is available in the toolbar on top of the Edit box in which you write your messages. On the left side, there is bold, italic, underlined, ... after the smilies, you have the 3 dots. Click on it and select to insert Code. That will put the two markers in your text box. All you need to do is to copy - paste your code between these 2 tags. This is what is done by default in the popup window that opens after clicking on Code.
 
Joined
Oct 18, 2018
Messages
969
@Heracles is spot on, replace that drive. Checksum errors are bad too, usually caused by a misbehaving drive.

If you ha e issues setting up automatic smart tests and email alerts let us know. Also, consider doing regular pool scrubs. The User Guide and forums have tons of info.

ALSO, consider buying 2 replacement drives and "burning in" both, that way you have a droce on deck when you need it. Again, forums have info about burning in.

Sorry I didnt provide more detail, I am on mobile.

Did I mention yet how important backups are?
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
Hi again,

To insert something like I did, you need to put it between code tag. That function is available in the toolbar on top of the Edit box in which you write your messages. On the left side, there is bold, italic, underlined, ... after the smilies, you have the 3 dots. Click on it and select to insert Code. That will put the two markers in your text box. All you need to do is to copy - paste your code between these 2 tags. This is what is done by default in the popup window that opens after clicking on Code.

Copying from the shell window was actually my problem, but I read the pop up window and noticed control - insert. Learned a lot from this thread already!
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
Copying from the shell window was actually my problem

An easy way to do it is from the shell FreeNAS offers you in its WebUI :
Log in the WebUI
In the left column, down at the very bottom, there is SHELL
That will open you a shell in FreeNAS for you to do your command
Select and copy whatever you need to paste here from that window

Good luck with your drive replacement,
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
@Heracles is spot on, replace that drive. Checksum errors are bad too, usually caused by a misbehaving drive.

If you ha e issues setting up automatic smart tests and email alerts let us know. Also, consider doing regular pool scrubs. The User Guide and forums have tons of info.

ALSO, consider buying 2 replacement drives and "burning in" both, that way you have a droce on deck when you need it. Again, forums have info about burning in.

Sorry I didnt provide more detail, I am on mobile.

Did I mention yet how important backups are?
I will do more research on this. I was watching some Lawrence Systems youtube video on backing up to cloud. I setup my box.com but could never get it to work. I think i have some backups if they are called snapshots. I have some list of files that looks like it does an auto snapshot every day. Don't know where or what these are, but if that's what a backup is, I have them somewhere. I have the system backup which i was lucky enough to save after an update. Accidently reset to defaults from the local screen and had an hour of panic trying to get my pool back.
 

Heracles

Wizard
Joined
Feb 2, 2018
Messages
1,401
but if that's what a backup is,


Unfortunately, No, they are not backups. They reduce some risks about users mistakes, but they offer nothing for IT or System failures.

See my signature for a complete backup strategy and how it can be done.
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
I tried to find a PM... but maybe not possible on this forum. While you are reading this and this is the most responses I have ever gotten. I am trying to get my new Chelsio T580-CR card to work, would you happen to have a good guide or idea where to start. I cannot get it to work. It works peer to peer if I connect a DAC to my computer and force IP addresses. But I cannot get it to work on the network by itself. It only works if the onboard port is plugged in as well and then it splits the traffic. Down is onboard and upload is chelsio card.
 
Joined
Oct 18, 2018
Messages
969
@Newtofreenas you'd have to provide us with more info such as what network configuration you currently have set on your NAS, router, etc.
 
Joined
Oct 18, 2018
Messages
969
See my signature for a complete backup strategy and how it can be done.
To add a bit, in the event @Heracles ever changes signatures ;) a good backup strategy depends on the sensitivity of the data. A good starting place it the 3-2-1 rule. Keep 3 copies of your data on at least two different types of media and have at least one copy in a second location. Think of this as an introduction to some of the main variables you can play with when considering your backup strategy. Budget, importance of data, risk tolerance, etc all will play a role in how you choose to do it. For example, I keep 5 copies of all of my data (my backups are all on zfs pools composed of mirrored vdevs), but everything is on HDDs so only 1 type of media, but I do keep 1 copy off-site in the event of fire etc.
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
To add a bit, in the event @Heracles ever changes signatures ;) a good backup strategy depends on the sensitivity of the data. A good starting place it the 3-2-1 rule. Keep 3 copies of your data on at least two different types of media and have at least one copy in a second location. Think of this as an introduction to some of the main variables you can play with when considering your backup strategy. Budget, importance of data, risk tolerance, etc all will play a role in how you choose to do it. For example, I keep 5 copies of all of my data (my backups are all on zfs pools composed of mirrored vdevs), but everything is on HDDs so only 1 type of media, but I do keep 1 copy off-site in the event of fire etc.

I am actually set pretty well as far as backups, just not in making sure my freenas box works, and how to recover from problems.
I have my wifes parents house and my parents house on a VPN tunnel link. They each have a 10TB USB drive attached that I backup the important things, pictures and documents. I have all the rest of my files duplicated to a old server locally.
 

Newtofreenas

Dabbler
Joined
Apr 17, 2019
Messages
22
I am so glad this is a new thread and that others are having this problem too. Not that you guys are having problems, but that someone has seen my problems. Been fighting this for 4 days now.

I am able to work 100% with both the new NIC and the on board 1G connected. But if I transfer a file and watch the dashboard, I notice that the Download part is 100% from the 1G and the upload is 100% on the 40G.

System is perfect if both cables are plugged but won't do anything if the 1G is disconnected, however the 1G works fine if the 40G is disconnected.

If I go computer to Freenas with a DAC directly connected then it works amazingly. But freenas will not work when connected to the pfsense router with only the 40G connected.

If i turn off the boot on start for my jail/plugin the tscsum errors do go away.
cxgbe0: tso4 disabled due to -txcsum.
cxgbe0: tso6 disabled due to -txcsum6.

However I still haven't been successful with getting a working freenas box with only the 40G connection plugged in.

I look forward to further follow ups on this problem
pfsense router
T580-CR card

Freenas
T580-CR card

Connected with DAC

I will try and figure out what network configuration is on the freenas box.
 
Top