After replacing a Disk all other disks are shown as degraded except the replacement is this normal?

Ganzir · Aug 22, 2020

Hello,

I just had to replace a Disk that showed up as "faulted".

I followed the guide that can be found here:

Replacing a failed/failing disk

So you've had a disk fail, or you know it's about to fail, and you need to replace it. And for whatever reason, the manual's instructions weren't clear enough. So, step by step, here's how to do it. Note: The illustrations below describe the...

www.ixsystems.com

In the last picture of this guide all drives appear as "online" that is not the case with my system.

The only drive that appears as online is the one I just replaced, everything else appears as "degraded".

This GUI also shows that the Resilvering-Process is in Progress.

Is this normal or is something else wrong?

I am using FreeNAS-11.0-U2

Thanks in advance.

Yorick · Aug 22, 2020

Sounds normal and like a specific UI behavior. What does zpool status from CLI show you? Use code tags when you paste that, please.

Ganzir · Aug 22, 2020

Here you go

Code:

  pool: freenas-boot                                                                                                               
 state: ONLINE                                                                                                                     
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jul 27 03:45:03 2020                                                         
config:                                                                                                                             
                                                                                                                                    
        NAME        STATE     READ WRITE CKSUM                                                                                     
        freenas-boot  ONLINE       0     0     0                                                                                   
          nvd0p2    ONLINE       0     0     0                                                                                     
                                                                                                                                    
errors: No known data errors                                                                                                       
                                                                                                                                    
  pool: mypool                                                                                                         
 state: DEGRADED                                                                                                                   
status: One or more devices is currently being resilvered.  The pool will                                                           
        continue to function, possibly in a degraded state.                                                                         
action: Wait for the resilver to complete.                                                                                         
  scan: resilver in progress since Sat Aug 22 20:53:43 2020                                                                         
        810G scanned out of 4.23T at 308M/s, 3h14m to go                                                                           
        203G resilvered, 18.72% done                                                                                               
config:                                                                                                                             
                                                                                                                                    
        NAME                                            STATE     READ WRITE CKSUM                                                 
        mypool                                          DEGRADED     0     0 1.78K                                                 
          raidz1-0                                      DEGRADED     0     0 6.52K                                                 
            gptid/d81e36b2-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors                                 
            gptid/d8d85c83-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors                                 
            gptid/c20e79a0-e4a8-11ea-9166-10c37b4dfcfa  ONLINE       0     0     0  (resilvering)                                   
            gptid/da72f949-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors                                 
                                                                                                                                    
errors: 316 data errors, use '-v' for a list

Yorick · Aug 22, 2020

You will want to dig into those data errors. That doesn't look good to my untrained eyes.

Ganzir · Aug 22, 2020

How would I do that, what would you recommend?

I was able to connect to the share from Windows, everything was shown as I remembered it, accessing a folder led to a timeout. That maybe due to the resilvering being in progress, since it said that performance may be reduced.

EDIT: I connected to the networkshare from Windows client again and was able to copy a file from the share to windows. So it seems all is not lost.

Thanks for the help so far, any Ideas on how to tackle those errors would be greatly appreciated.

EDIT2: The resilvering is now complete, I can access the share, however the Pool remains in a degraded state, only the newly added disk appears as "online".

It tells me there are 495 error, how can I address those?

EDIT3: When I type "zpool status -v" it produces a list that is so long, I cannot see the beginning, I have no idea how to scroll upward in the shell, any hint is appreciated.

Yorick · Aug 23, 2020

Enable the SSH service and root access for it in the details; use ssh to connect, either from Powershell (Windows) or Terminal (MacOS/Linux) or PuTTY if you are on an old version of Windows.

Once connected: zpool status -v | less. Inside less, you can scroll down with SPACE, up with b, quit with q, and search with /.
You can also run these errors into a file: zpool status -v >~/my-zfs-errors.txt. Then use WinSCP or a similar tool on MacOS/Linux to copy the file to your local desktop.

You'll also want to get smartctl for each of the drives that are showing errors. That's assuming you do run regular SMART tests on those. smartctl -x /dev/adaX if SATA, you'll see the drive devices under Storage->Pool and then the cog and "Status" in FreeNAS. Just as with any command, you can pipe those into less or run the output into a file.

I have not dealt with failing pools on ZFS, that is why I say "untrained eyes". The purpose of smartctl is to see whether the errors are caused by more failing disks, though that doesn't sound likely since it's so spread out. The other possibilities would be bad memory, bad cables, bad controller.

You may want to list your hardware so people who come into this thread and know more can assist.

Ganzir · Aug 23, 2020

Hello,

thanks for your reply, I the meantime I managed to connect via Putty, which allows me to scoll up and down, so this problem is solved.

I started to follow this guide here

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

UPDATE: 22 September 2018 - Added Drive Data Refreshing UPDATE: 2 April 2017 - Added support for FreeNAS Corral (FreeNAS 10 and beyond) UPDATE: 1 November 2020 - Added ID 1 and 7 description for Seagate drives at bottom of Appendix B This guide...

www.ixsystems.com

to diagnose if there is something wrong with the discs.

I ran the

Code:

 smartctl –t long /dev/ada0

over night which yielded the following (those are the results for the first disk only, just running it on the second disk as I type this:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 253 218 021 Pre-fail Always - 7266
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 144
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10232
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 144
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 23
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 120
194 Temperature_Celsius 0x0022 119 100 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 23

The guide tells me

ID 200 MultiZone Error Rate can be the cause of a drive failure although a value in this location doesn't always mean it's the fault. It is notable if there are no other failing indications.

I'd say this is the case here, because the values on ID 5 197 198 and 199 are zero, or would you come to a different conclusion?

So would I have to replace all remaining drives that show up with a Multi_Zone_Error_Rate counter about 0 if the other indicators are 0?

And how to go about diagnosing the rest of the system, its not like I am having a few spareparts around to switch out components? - Any advice is most welcome.

Regarding the hardware:

HDDs: 4 x Western Digital WD RE4-GP 2 TB (one was faulty and has been replaced by a WD WD80EFAX Red 8TB)
RAM: 8 GB (i cant remember which vendor)
CPU: i5-4690 CPU @ 3.50GH
Mainboard: H97i-Plus from Asus
Case: Fractal Node 304

The system was build in May 2016

Any further input is appreciated

Yorick · Aug 23, 2020

I don’t see anything in smart that would indicate the drive is failing. Normalized multi zone error is 200, and you have no reallocated sectors, read errors, or anything else that would indicate the drive failing.

what are the errors zpool status -v shows you? Unrecoverable files?

you can test memory with memtest86, you’d want to run that for a few days to be sure. FreeNAS will be down during that time.

Ganzir · Aug 23, 2020

Yes there are some files which are damaged permanently. Mosty PDF-Documents, which can be recovered elsewhere. However some files seem to be FreeNAS related. See below:

Code:

        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200501.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200502.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200503.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200504.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200505.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200506.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200507.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200508.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200509.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200510.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200511.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200512.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200513.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200514.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200515.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200516.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200517.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200518.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200519.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200520.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200521.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200522.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200523.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200524.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200525.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200526.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200527.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200528.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200529.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200530.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200531.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200601.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200602.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200603.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200604.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200605.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200606.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200607.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200608.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200609.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200610.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200611.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200612.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200613.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200614.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200615.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200616.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200617.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200618.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200619.db
        /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200620.db

OK regarding Memtest, I could do something else:

It just so happens, that I have 16 GB of Memory lying around, so I could pack those into the NAS, and test the 8 GB elsewhere maybe with Karhu of which I happen to own a copy, from what I've heard Karhu tests more extensive than memtest.

Is it problmatic to swap out RAM from the FreeNAS system ... I am mainly a Windows user so I know that swapping RAM in Windows is no Problem I do not know how FreeBSD behaves in this regard.

Yorick · Aug 23, 2020

Those are old config backups and can be safely deleted.

You can safely swap RAM, no problem at all. As long as your board supports it, you are golden - and if it doesn't it won't even POST.

Do you run a regular scrub? If not, I'd run one now, and see what it comes back with. Then get rid of any files that it labels as permanently damaged, and replace from backup where possible.

Good maintenance for a FreeNAS system, for AFTER you have this resolved, is:

- Short and long SMART tests, scheduled. I have a short weekly and a long monthly.
- Scrub, scheduled. I have threshold days at 35, and the task set to Sunday. That way it runs every 5 weeks.
- Email alerts set up. You need to know when something goes wrong.
- If you have IPMI: Email alerts for that set up, as well.
- Optional: Config backups daily synced to cloud. You can rely on the automatic ones, but you're relying on the pool being healthy to recover them.
- Recommended: Scheduled snapshots, kept for 1 week, 2 week, whatever you have in mind, so you can get back from accidental deletion, ransomware, and such. I have a separate scheduled snapshot run for each top-level dataset, recursive. That way I can set duration individually.
- Some form of backup. A lot of folk swear by 3-2-1: Three copies of data, two local on different devices, one offsite. See https://www.backblaze.com/blog/the-3-2-1-backup-strategy/. Your mileage will vary. For example, for all my documents, pictures, music, it's 3-2-1: Master copy on PC, backup copy on TrueNAS, synced copy on OneDrive. But for my Bluray/DVD collection, it's 2-2: One physical copy on Bluray/DVD, one copy on TrueNAS. It's 12 GiB of data and growing, I don't have the bandwidth to sync it to Backblaze. If my house ever burns down, I figure I have bigger worries than whether I need to buy movies again.
- Arguably, a patch cycle, as with any software. Just to make sure known vulnerabilities are patched. Being conservative with that is fine, but I'd avoid going into EOL. 11.2-U8 is the most conservative stable version; 11.3-U4.1 can be trusted; and 12.0 won't be ready for consideration before 2021. Note that 11.2 changes the jail manager. If you are using jails, they'd need to be re-created.

ixSystems have now documented a software life cycle at https://www.truenas.com/docs/hub/intro/lifecycle/. The two most current major versions will remain supported. Which means you could, if so desired, move up one roughly every year, or have larger jumps by two versions roughly every two years.

Ganzir · Aug 23, 2020

Hello,

thanks for your continued and extensive support. - Truth be told, I build the system an let it run, not caring about it one bit. - That is the long and short of it. - So until now nothing of the things you suggested happened. I will however take your advice to heart and set up the tests you suggested.

Since I do not work that often with Linux and even more seldom with FreeBSD I have some additional questions.

Some of the corrupt files I deleted from the windows share. Simply by right-click -> delete in Windows.

Those backups are not on the windows share so would I do something like

Code:

rm /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200501.db

form the shell or via ssl and than do so for each and every file appearing on the list that cannot be deleted from windows or is something else required.

I am kind of worried to break more than I am fixing.

Regarding those Scrubs, can I run those, while the long SMART test is running on ada1 or is that inadvisable and I therefor should wait till the SMART test is finished?

Thanks again and king regards.

EDIT:

Some additional files that show up when doing "zpool status -v" are the following:

Code:

        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x16e>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x16f>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x170>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x171>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x172>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x173>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x174>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x175>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x176>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x177>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x178>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x179>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x17a>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x17b>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x17c>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x17d>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x17e>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x17f>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x180>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x181>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x182>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x183>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x184>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x185>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x186>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x187>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x188>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x189>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x18a>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x18b>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x18c>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x18d>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x18e>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x18f>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x190>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x191>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x192>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x193>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x194>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x195>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x196>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x197>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x198>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x199>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19a>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19b>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19c>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19d>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19e>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19f>
        mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x1a0>

What are those?

Yorick · Aug 23, 2020

Yes, you can remove files from SSH shell. You'll need quotes because the file path has spaces:

Code:

rm "/var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/FreeNAS-11.0-U2 (e417d8aa5)/20200501.db"

Keep in mind you can tab-expand with the TAB key.

You can blow away the entire config backup directory as well, but you'll want to take care to copy-paste the command in its entirety and not hit ENTER until you have verified it. This can remove critical system files if the path is wrong.

Code:

rm -rf /var/db/system/configs-411517dcdf2b43f3a96114fb41a9292c/

If you make a mistake and run that on /, you'll destroy everything including the contents of your pool, so take extra care.

I believe the additional files you are showing are metadata. Corrupt metadata can get you in a state where the only way to recover is to destroy and re-create the pool. Let's hope that's not the case here. You may need to delete that entire directory to resolve the error. It should be re-created automatically when the system next takes an automated config backup. Speaking of, it's probably a good idea to take a manual config backup just in case: System->General and save config.

Ganzir said:
Regarding those Scrubs, can I run those, while the long SMART test is running

I'd think you can run a scrub of your data pool in parallel, it might just take a little longer.

Ganzir said:
I build the system an let it run, not caring about it one bit

I assumed that. Otherwise, you'd likely have had warnings before you got into this state. It's fine, most people don't want to be sys admins. All these "best practice" steps are a form of "smart lazy": Some work up front, so there is less work trying to repair damage, later.

Ganzir · Aug 23, 2020

Thank you.

A

Code:

rm 'mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19d>'

does not work. It yields "no such file or direcotry"

I tried setting up a scrub, however the gui is somewhat confusing (at least to me):

In my version there are the fields:

Threshold days

Minute (two tabs "every N minute" or "each selected minute) - the same for hour and day of month. - I chose "each selected" for each of the three fields

Hour

Day of month

Month (check boxes for every month)

Day of week (again checkboxes)

Now do I understand this correctly:

I chose

Threshold day 20
Minute 15
Hour 20
Day of month 23rd
Activate the checkboxes for all months
Day of week: I checked sunday

and there is a checkbox that says "enabled" I checked this as well.

So if 20 days have passed since the last scrub. it 20:15 on the 23rd of a given month and that day is a Sunday (like today) than the system will perform a scrub?

Can I monitor somewhere what the system is currently doing? Something like current tasks:

a) Smart long test in progress X%
b) Scrub in progress Y%

Or do I have to develop the sys-admin-skill and simply know what my system is doing at any given moment?

Regarding System updates. - The System tells me that there are updates availabe and I certainly do not consider myself anywhere nearly proficient enough to cope with issues that might arise from running a beta-release, 11.3 stable seems to be the newest stable version available.

Can I update directly to this version, or do I have to patch the system to the most current stable version of 11.0, than 11.1 11.2 and finally 11.3?

Or should I go with another version which - from your point of view - may be more suitable?

Again many thanks for your support so far.

EDIT:

Here are the results from ada1

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0024   100   253   000    Old_age   Offline      -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       142
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0024   100   253   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11868
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       142
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   041   000    Old_age   Always       -       38
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       120
194 Temperature_Celsius     0x0022   114   093   000    Old_age   Always       -       38
195 Hardware_ECC_Recovered  0x0036   200   200   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       2

Yorick · Aug 23, 2020

rm 'mypool/.system/configs-411517dcdf2b43f3a96114fb41a9292c:<0x19d>'

That's metadata. It's data about where a file in the directory keeps its bits. You can't rm it. Only files can be deleted.

ada1 is looking good.

Ganzir said:
So if 20 days have passed since the last scrub. it 20:15 on the 23rd of a given month and that day is a Sunday (like today) than the system will perform a scrub?

Yes. That's exactly it. You can also kick off a scrub from command line with zpool scrub poolname

Ganzir said:
Can I monitor somewhere what the system is currently doing?

You can, though I admit I'm not doing that. I'm just relying on receiving email messages that tell me that a scrub finished. smartctl -x /dev/DISKDEVICE will show you the status of a current smart test, and zpool status should show you the status of an ongoing scrub. I think. As I said, I'm not actually doing any of that, I rely on email alerts.

Ganzir said:
Can I update directly to this version, or do I have to patch the system to the most current stable version of 11.0, than 11.1 11.2 and finally 11.3?

People have updated directly and been successful, and also not been successful. I'm conservative when it comes to tech, so I'd go to 11-STABLE first (that'll be some value of 11.1), then 11.2, then move over / recreate any and all warden jails as iocage jails because warden jails will not work in 11.3, and THEN move to 11.3. Caveat that there is a known issue with GELI encrypted volumes and 11.3, if that's you, maybe halt at 11.2.

Ganzir · Aug 23, 2020

OK I am not using jails and GELI encrypted volumes (as far as I know) however I will do it step by step until I reach 11.2.

Here again the question: Updating while I know that a scrub is in progress is completely nuts or is just fine to do?

Yorick · Aug 23, 2020

It's fine, ZFS will resume the scrub, but upgrading doesn't help you resolve your issue, and I question why you'd want to troubleshoot more than one thing at a time. I'd hold off until your pool is in good shape.

Ganzir · Aug 23, 2020

OK thanks, the scrub is underway

Code:

root@freenas:~ # zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jul 27 03:45:03 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          nvd0p2    ONLINE       0     0     0

errors: No known data errors

  pool: mypool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sun Aug 23 20:15:00 2020
        1.09T scanned out of 3.41T at 298M/s, 2h16m to go
        344K repaired, 31.88% done
config:

        NAME                                            STATE     READ WRITE CKSUM
        mypool                                          DEGRADED     0     0 10.0K
          raidz1-0                                      DEGRADED     0     0 36.9K
            gptid/d81e36b2-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     9  too many errors  (repairing)
            gptid/d8d85c83-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors
            gptid/c20e79a0-e4a8-11ea-9166-10c37b4dfcfa  ONLINE       0     0     0
            gptid/da72f949-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors

errors: 501 data errors, use '-v' for a list

However now there are 501 data errors, I did not write anything onto the share in the meantime, is the pool now deteriorating by itself?

I started the long smart test on ada3, ada2 was the one I replaced so I see no point in testing it, since it could not have contributed to the state of degradation.

Yorick · Aug 23, 2020

Ganzir said:
344K repaired, 31.88% done

Something's afoot. You have something in your system that's actively contributing to %&%&ing up your data. If it's not a disk - and it shouldn't be, three of four are proven fine - it's memory or whatever the disks plug into.

I'd replace memory sooner rather than later. I'm not sure what else to suspect heavily. There's always "board/CPU", which is the worst-case scenario.

I really hope you get this figured out.

Ganzir · Aug 23, 2020

OK thanks for the advice, I just ordered new RAM after noticing, that the RAM I have lying around is DDR4 and the system uses DDR3 ....

Here is the final result of the scrub that was running since 20:15

Code:

root@freenas:~ # zpool status
  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon Jul 27 03:45:03 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          nvd0p2    ONLINE       0     0     0

errors: No known data errors

  pool: mypool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 740K in 3h39m with 323 errors on Sun Aug 23 23:54:02 2020
config:

        NAME                                            STATE     READ WRITE CKSUM
        mypool                                                                         DEGRADED     0     0 15.4K
          raidz1-0                                                                      DEGRADED     0     0 57.8K
            gptid/d81e36b2-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0    18  too many errors
            gptid/d8d85c83-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors
            gptid/c20e79a0-e4a8-11ea-9166-10c37b4dfcfa  ONLINE       0     0     0
            gptid/da72f949-1a9a-11e6-b0c5-10c37b4dfcfa  DEGRADED     0     0     0  too many errors

EDIT:
Furthermore the smart long test of ada3 completed, here are the results:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       5
  3 Spin_Up_Time            0x0027   253   225   021    Pre-fail  Always       -       8233
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       143
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11865
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       143
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       22
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       120
194 Temperature_Celsius     0x0022   116   095   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

So the 3 older remaining drives seem to be ok.

I also noticed, that after the scrub, the corrupted meta-data (those entries that could not be removed with rm) are no gone.

And it now shows 'only' 323 data errors, so the scrub has done some good I guess.

Once I have replaced the RAM, how will the system will be brought back to a non-degraded state?

Kind Regards

Yorick · Aug 24, 2020

Ganzir said:
I also noticed, that after the scrub, the corrupted meta-data (those entries that could not be removed with rm) are no gone.

That's great, that'll really help.

Ganzir said:
And it now shows 'only' 323 data errors, so the scrub has done some good I guess.

All in files that can't be recovered? You'll delete all of them, run another scrub. It should come back clean. Then restore files from backup.

"RAM" was just a wild guess, it's entirely possible that's not related to your issue at all. You'll find out when you test the stick that's in there now.

Important Announcement for the TrueNAS Community.

After replacing a Disk all other disks are shown as degraded except the replacement is this normal?

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Explorer

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "After replacing a Disk all other disks are shown as degraded except the replacement is this normal?"

Similar threads