System rebooted during disk replace operation

REDPOOL · Sep 28, 2017

Hello forum,

Please, help to diagnose the reboot issue during disk replace.

Symptom:
System rebooted during disk replace operation. After reboot it happens that replace passed successfully. Pool is in healthy state and is resilvering now. No errors reported.

System:
x510/2GB RAM/E7400 CPU with 4x1TB disks configured as RaidZ1 (REDPOOL - ada4/ada5/ada6/ada7)
IcyDock JBOD connected to x510 on esata with 4x1TB disks configured as RaidZ1 (ICEPOOL - ada0/ada1/ada2/ada3)
FreeNAS 11.0 U3 + emby jail powered off + plex jail running
Periodic snapshot scheduled to run 09:00-18:00 everyday. Replication from x510 to IcyDock on all datasets with periodic snapshot.

What caused the reboot:
1) In ICEPOOL I've set disk ada3 to OFFLINE
2) In REDPOOL I've set disk ada6 to OFFLINE
3) removed disks ada3 and ada6
4) installed disk ada3 in ada6 bay
5) installed disk ada6 in ada3 bay
6) initiated REPLACE to member disk ada6 in REDPOOL with force option
7) waited for REPLACE to finish. WEB gui was not responsive. Samba was not responsive. PING failed.
8) It happens that system restarted. Because later all services loaded normally
9) After restarts initiated REPLACE to member disk ada3 for pool ICEPOOL. Passed successfully without restart and resilvering now.

zpool status on both pools reports that drives are ONLINE and resilvering.

No reboot/halt messages. Cannot find out at what point system actually rebooted. Reallocation event should not cause system restart.
Last records from messages log before restart:
Sep 28 09:12:24 VAULT (aprobe3:siisch0:0:3:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Sep 28 09:12:24 VAULT (aprobe3:siisch0:0:3:0): CAM status: Uncorrectable parity/CRC error
Sep 28 09:12:24 VAULT (aprobe3:siisch0:0:3:0): Error 5, Retries exhausted
Sep 28 09:12:24 VAULT (aprobe3:siisch0:0:3:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
Sep 28 09:12:24 VAULT (aprobe3:siisch0:0:3:0): CAM status: Uncorrectable parity/CRC error
Sep 28 09:12:24 VAULT (aprobe3:siisch0:0:3:0): Error 5, Retries exhausted
Sep 28 09:13:48 VAULT smartd[73500]: Device: /dev/ada6, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.

REDPOOL · Sep 28, 2017

This is what FreeNAS sent by mail after I clicked REPLACE ada6 and before accidental restart.

Code:

The volume ICEPOOL state is DEGRADED: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The volume REDPOOL state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

It looks like REPLACE completed and resilvering process was ongoing when FreeNAS suddenly rebooted. What could cause it to reboot if replace was successful?

danb35 · Sep 28, 2017

REDPOOL said:
What could cause it to reboot if replace was successful?

Any number of things, and you haven't given us anything to go on. If there's nothing in the system log indicating a problem, my first guess would be power, but that's only a guess. It's also possible that the crappy SATA port expander in your "JBOD" unit caused the problem. But why were you doing that replace operation?

Jailer · Sep 28, 2017

REDPOOL said:
2GB RAM

Are you seriously trying to run FreeNAS on 2GB of RAM?

danb35 · Sep 28, 2017

Jailer said:
Are you seriously trying to run FreeNAS on 2GB of RAM?

Don't know how I missed that--yeah, that would do it.

REDPOOL · Sep 28, 2017

Thank you for your reply.

This was not the power issue. Devices are on UPS. I have copy of all /var/log with me. I've reviewed them, nothing that could lead to restart. I know that system restarted after 09:12 and fragment of messages log provided in first post covers that period.

If it is due to scrappy SATA perhaps dmesg would give some information. I've reviewed it already, please take a look as well or request for other log that you think we should look into.

I've been doing REPLACE operation in order to place 5400RMP WD disks in one pool and 7200RPM Seagate disks in another pool. They were mixed before and this could reduce performance during parallel read/write.

Thanks beforehand.

Below is part of debug.log which covers actions that were run on backend when I clicked REPLACE on offline drive.

Code:

Sep 28 09:10:45 VAULT uwsgi: [middleware.notifier:151] Executing: /sbin/swapoff /dev/ada6p1.eli
Sep 28 09:10:45 VAULT uwsgi: [middleware.notifier:172] Executed: /sbin/swapoff /dev/ada6p1.eli -> 1
Sep 28 09:10:45 VAULT uwsgi: [middleware.notifier:151] Executing: /sbin/geli detach /dev/ada6p1
Sep 28 09:10:45 VAULT uwsgi: [middleware.notifier:172] Executed: /sbin/geli detach /dev/ada6p1 -> 1
Sep 28 09:10:45 VAULT uwsgi: [middleware.notifier:151] Executing: dd if=/dev/zero of=/dev/ada6 bs=1m count=32
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:172] Executed: dd if=/dev/zero of=/dev/ada6 bs=1m count=32 -> 0
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:198] Popen()ing: diskinfo ada6
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:151] Executing: dd if=/dev/zero of=/dev/ada6 bs=1m oseek=953837
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:172] Executed: dd if=/dev/zero of=/dev/ada6 bs=1m oseek=953837 -> 1
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:198] Popen()ing: gpart create -s gpt /dev/ada6
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:198] Popen()ing: gpart add -a 4k -b 128 -t freebsd-swap -s 4194304 ada6
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:198] Popen()ing: gpart add -a 4k -t freebsd-zfs ada6
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:198] Popen()ing: gpart bootcode -b /boot/pmbr-datadisk /dev/ada6
Sep 28 09:10:46 VAULT uwsgi: [middleware.notifier:198] Popen()ing: zpool status freenas-boot
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:151] Executing: gpart destroy -F /dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:172] Executed: gpart destroy -F /dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3 -> 1
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:151] Executing: gpart create -s gpt /dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:172] Executed: gpart create -s gpt /dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3 -> 1
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:151] Executing: gpart destroy -F /dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:172] Executed: gpart destroy -F /dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3 -> 1
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:198] Popen()ing: dd if=/dev/zero of=/dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3 bs=1m count=32
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:198] Popen()ing: diskinfo gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3
Sep 28 09:10:47 VAULT uwsgi: [middleware.notifier:198] Popen()ing: dd if=/dev/zero of=/dev/gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3 bs=1m oseek=951789
Sep 28 09:10:48 VAULT uwsgi: [middleware.notifier:198] Popen()ing: /sbin/zpool replace -f REDPOOL 3317370949338195174 gptid/62d3330a-a40b-11e7-a9cd-00262d0143d3
Sep 28 09:10:52 VAULT uwsgi: [middleware.notifier:151] Executing: /sbin/zpool detach REDPOOL 3317370949338195174
Sep 28 09:11:02 VAULT uwsgi: [middleware.notifier:172] Executed: /sbin/zpool detach REDPOOL 3317370949338195174 -> 0
Sep 28 09:11:02 VAULT uwsgi: [middleware.notifier:151] Executing: /sbin/geli onetime /dev/ada6p1
Sep 28 09:11:03 VAULT uwsgi: [middleware.notifier:172] Executed: /sbin/geli onetime /dev/ada6p1 -> 0
Sep 28 09:11:03 VAULT uwsgi: [middleware.notifier:151] Executing: /sbin/swapon /dev/ada6p1.eli
Sep 28 09:11:03 VAULT uwsgi: [middleware.notifier:172] Executed: /sbin/swapon /dev/ada6p1.eli -> 0
Sep 28 09:11:05 VAULT /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 28 09:11:05 VAULT /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-43760
Sep 28 09:11:05 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/archive
Sep 28 09:11:17 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/backup
Sep 28 09:11:29 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/media/family
Sep 28 09:11:30 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/users
Sep 28 09:11:31 VAULT /autorepl.py: [tools.autorepl:629] Autosnap replication finished
Sep 28 09:12:03 VAULT uwsgi: [ws4py:360] Closing message received (1000) 'b'''
Sep 28 09:12:07 VAULT /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 28 09:12:07 VAULT /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-44394
Sep 28 09:12:07 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/archive
Sep 28 09:12:19 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/backup
Sep 28 09:12:20 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/media/family
Sep 28 09:12:26 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/users
Sep 28 09:12:38 VAULT /autorepl.py: [tools.autorepl:629] Autosnap replication finished
Sep 28 09:13:09 VAULT /autorepl.py: [tools.autorepl:221] Autosnap replication started
Sep 28 09:13:09 VAULT /autorepl.py: [tools.autorepl:222] temp log file: /tmp/repl-44578
Sep 28 09:13:09 VAULT /autorepl.py: [tools.autorepl:291] Checking dataset REDPOOL/archive
Sep 28 09:15:04 VAULT /autosnap.py: [tools.autosnap:136] Checking if process 44853 is still alive
Sep 28 09:15:04 VAULT /autosnap.py: [tools.autosnap:140] Process 44853 still working, quitting
Sep 28 09:16:05 VAULT /autosnap.py: [tools.autosnap:136] Checking if process 44853 is still alive
Sep 28 09:16:05 VAULT /autosnap.py: [tools.autosnap:140] Process 44853 still working, quitting
Sep 28 09:17:04 VAULT /autosnap.py: [tools.autosnap:136] Checking if process 44853 is still alive
Sep 28 09:17:04 VAULT /autosnap.py: [tools.autosnap:140] Process 44853 still working, quitting

Jailer · Sep 28, 2017

Jailer said:
Are you seriously trying to run FreeNAS on 2GB of RAM?

Maybe you missed it the first time.

There is no way you are going to get anywhere with this trying to run FreeNAS on 2GB of RAM. And to add insult to injury you're trying to run jails as well.

Minimum Hardware Requirements:
These specifications will suffice to get a small FreeNAS install running reliably with moderate performance for a few users.

Multicore 64-bit* processor (Intel strongly recommended)

8GB* Boot Drive (USB Flash Drive suffices)

8GB* RAM

At least 1 direct attached disk (Hardware RAID strongly discouraged)

One physical network port

http://www.freenas.org/hardware-requirements/

REDPOOL · Sep 28, 2017

Dear Jailer and danb35,

before you start cursing me for using ZFS with 2GB RAM, please see below :)

I read the manuals, I know that they say 8GB is minimum. I realize that 2GB is too low for OPTIMAL performance on ZFS and for ARC requests and for prefetch. But if I'm not mistaken those are related to performance of the system not to stability. Why should it fail now with disk replace function if it worked fine before serving samba, transcoding, taking daily snapshot and doing replication from uncompressed dataset in one pool to gzip dataset in another pool and 2GB was enough all the time. Then suddenly when I do disk replace 2GB is not enough?

Jailer · Sep 28, 2017

REDPOOL said:
But if I'm not mistaken those are related to performance of the system not to stability.

You are mistaken. As you have discovered running less than the minimum spec doesn't matter.....until it does.

REDPOOL · Sep 28, 2017

And it didn't utilize swap all this time. 1GB used by system/user and 1GB is allocated for ARC during normal operations. And when it does intensive read/write and needs more RAM then ARC size is reduced to 0, which should be OK. It doesn't cache in this case, but it still works with disks directly. And it utilized only few MB of swap.

REDPOOL · Sep 28, 2017

I understand that running the system with less than 8GB is out of spec and may lead to performance issues, but I'm looking forward for some explanation on why it would cause stability issues, especially if SWAP was not utilized, which leads to the point that RAM was enough.

Ericloewe · Sep 28, 2017

It is not supported. End of discussion. If you can't be bothered to accept that you're running one quarter the minimum amount of RAM, don't expect the forum to waste time with your issues.

Important Announcement for the TrueNAS Community.

System rebooted during disk replace operation

REDPOOL

Cadet

REDPOOL

Cadet

danb35

Hall of Famer

Jailer

Not strong, but bad

danb35

Hall of Famer

REDPOOL

Cadet

Attachments

Jailer

Not strong, but bad

REDPOOL

Cadet

Jailer

Not strong, but bad

REDPOOL

Cadet

REDPOOL

Cadet

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

System rebooted during disk replace operation

Cadet

Cadet

Hall of Famer

Not strong, but bad

Hall of Famer

Cadet

Attachments

Not strong, but bad

Cadet

Not strong, but bad

Cadet

Cadet

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "System rebooted during disk replace operation"

Similar threads