SOLVED NAS Random rebooting, random and sometimes no errors

Joined
Jul 19, 2016
Messages
72
It first started with the NAS suddenly rebooting. The NAS is now several years old, a supermicro with 16 of 24 harddrives installed, varying from different sizes oldest being 320GB disk to now newer 4TB disk. They all are running mirrors so that if I needed an increase in space I would just need to add 2 more drive each time, and if one failed I would just change the disk in that mirror.

Zpool Status:
Code:
root@freenas:~ # zpool status
  pool: FreeNAS
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Mar 27 15:04:42 2020
        4.10T scanned at 2.37G/s, 2.56T issued at 3.98G/s, 13.6T total
        0 resilvered, 18.82% done, 0 days 00:47:22 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        FreeNAS                                           DEGRADED     0     0     0
          mirror-0                                        DEGRADED     0     0     0
            gptid/b6ff17e7-5d7b-11e6-9dea-0025900943c8    ONLINE       0     0     0
            replacing-1                                   DEGRADED     0     0     0
              6299447651800029231                         OFFLINE      0     0     0  was /dev/gptid/b7b83d80-5d7b-11e6-9dea-0025900943c8
              gptid/e05a2fe7-7033-11ea-84e3-0025900943c8  ONLINE       0     0     0
          mirror-1                                        ONLINE       0     0     0
            gptid/682e9cdf-5d7d-11e6-9dea-0025900943c8    ONLINE       0     0     0
            gptid/690ee98c-5d7d-11e6-9dea-0025900943c8    ONLINE       0     0     0
          mirror-2                                        ONLINE       0     0     0
            gptid/62810637-5d8f-11e6-a561-0025900943c8    ONLINE       0     0     0
            gptid/646e6ae8-5d8f-11e6-a561-0025900943c8    ONLINE       0     0     0
          mirror-3                                        ONLINE       0     0     0
            gptid/27a61ff8-5d93-11e6-a561-0025900943c8    ONLINE       0     0     0
            gptid/2886473a-5d93-11e6-a561-0025900943c8    ONLINE       0     0     0
          mirror-4                                        ONLINE       0     0     0
            gptid/1722191b-5f8d-11e6-9f9d-0025900943c8    ONLINE       0     0     0
            gptid/5ef74856-1e00-11e8-b701-0025900943c8    ONLINE       0     0     0
          mirror-5                                        ONLINE       0     0     0
            gptid/cf05621c-de46-11e6-aa30-0025900943c8    ONLINE       0     0     0
            gptid/af074c71-605e-11e6-9f9d-0025900943c8    ONLINE       0     0     0
          mirror-6                                        ONLINE       0     0     0
            gptid/ade86ee6-647d-11e6-b38d-0025900943c8    ONLINE       0     0     0
            gptid/aeab2d2c-647d-11e6-b38d-0025900943c8    ONLINE       0     0     0
          mirror-7                                        ONLINE       0     0     0
            gptid/033fc681-6486-11e6-b38d-0025900943c8    ONLINE       0     0     0
            gptid/058ab140-6486-11e6-b38d-0025900943c8    ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:24 with 0 errors on Fri Mar 27 03:45:24 2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors


So the problem is this sudden reboot every 15-45min, and to start with it looked like a 3TB disk in the first mirror was failing, the "less /var/log/messages" did not show any error but I was standing by the NAS one time it rebooted and it showed errors on disk in bay2 just before the reboot on the display. I tried running some smart tests on this and first it did not give errors, but then it gave some erros, and then no error it was kind of weird.

Without spending to much time i went out and got 2 new 4TB disk to just replace it. And then I ran into the second problem I was unable to replace the unit. tried gpart destroy on the disk and only after a couple of reboots I was able to start the replacement, as one can see from the zpool status above.

And this led to the third problem still with this going the system randomly reboots, the same error apeared on the bay2 with the brand new drive. By now I had dusted off an old PC laying around and hooked up the "defect" 3TB disk and running some extended testing (so far short,long shows no error and its doing a zero bit test now).
Its been a years since I cleaned the NAS so I shut it off and and started vacum cleaning and air pressure the NAS, taking out each bay and getting all dust out.
And after booting up there where no error on bay2, but still random reboot, this time I saw a error just before reboot (nothing in log) that bay4 was acting weird. Again I shut down took it out, checked the SATA cable and put it back in. Error seems to be gone but still some weird reboots.

I am now thinking there is some error with either a sudden bad SATA cable, not sure if thats possible or the SATA controller is starting to die on me. I will have to look into if its the same SATA controller that is going on bay2 and bay4 to be sure. For now I am just letting it stay on doing its resilver see if that can be done before next reboot, its been on for 35 min without problems for now.

But is there a way to "stresstest" more than just SMART on each drive to see what bay that triggers a reboot?
 

Attachments

  • freenas_system11.png
    freenas_system11.png
    70.1 KB · Views: 201
Last edited:
Joined
Jul 19, 2016
Messages
72
It hasnt rebooted yet but now Bay4 is showing this. Anyone know what this means?
 

Attachments

  • IMG_20200328_124651.jpg
    IMG_20200328_124651.jpg
    352.5 KB · Views: 189

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Joined
Jul 19, 2016
Messages
72
That implies a disk read error.

So maybe there is 2 drives failing on me one in 2 different mirrors?

It has gone hours without reboot now and resilvering is at 86% so I will let it go til it stops then I will try that test in your link.

EDIT: This error on bay4 is just the same as it was on the "faulty" drive in bay2. And now after full test short/long SMART, and write zero of that disk in another machine there is no error on that disk. Might it be something else like the SATA controller?
 
Last edited:
Joined
Jul 19, 2016
Messages
72
Little update today. The disk in bay2 was put back in but replaced the other showing error on bay4. The bay4 disk has now run a WD test using ultimate boot cd on a different machin. And this unit might be the main culpit of things. It seems to "crash" on even minor SMART test. I did a full zero bit delete on it and that was sucessfull. But doing the SMART does not seem to work even now. So I think that unit is bad.

I am now doing this solnet-array-test on the rest of the system to see if there are other drives giving errors. This will probably take some time.
 
Joined
Jul 19, 2016
Messages
72
SOLVED

Here is the culpit for it all it seems disk in bay4. Though disk in bay2 reporter errors a full disk test in a different pc showed no error. But disk from Bay4 failed on all smart test in different machine. And with solnet array test showed no errors.

What I do find a bit weird is that this relative new 3tb disk failed. But the older 320gb and 500gb disk keeps going strong...
 

Attachments

  • IMG_20200329_181029.jpg
    IMG_20200329_181029.jpg
    322.8 KB · Views: 171
Top