Thibaut
Dabbler
- Joined
- Jun 21, 2014
- Messages
- 33
Hello,
We are using FreeNAS (11.2-U4.1) as our main company storage system which has been active for 5 years (starting with FreeNAS 9.3) without any major flaws.
The main storage pool contains all our "active jobs" data, which are all based on the same datasets structure. This structure is quite simple, consisting of one main "parent" dataset, itself containing three "children" datasets one of which being a zvol, as follow:
work-pool/job-dataset
work-pool/job-dataset/dataset-1
work-pool/job-dataset/dataset-2
work-pool/job-dataset/zvol
Once a job has been completed, we archive the whole job to a remote system, currently running Debian 9 with the OpenZFS filesystem kernel modules, using the
We haven't encountered any problem using this technique until recently.
For an unidentified reason, some datasets are not getting transferred correctly anymore, while others behave just as expected and get transferred without a problem.
What can be observed with the problematic datasets is as follow:
Sending / receiving the problematic datasets on the same pool (duplicating) works as expected.
Both pools are healthy and have been scrubbed without error.
We were unable to establish any difference between datasets that get transferred without problem and those that stall the transfer, they all are created by duplicating a "master template" dataset using:
Any suggestion, idea or further direction to investigate this problem would be more than welcome!
Thank you.
We are using FreeNAS (11.2-U4.1) as our main company storage system which has been active for 5 years (starting with FreeNAS 9.3) without any major flaws.
The main storage pool contains all our "active jobs" data, which are all based on the same datasets structure. This structure is quite simple, consisting of one main "parent" dataset, itself containing three "children" datasets one of which being a zvol, as follow:
work-pool/job-dataset
work-pool/job-dataset/dataset-1
work-pool/job-dataset/dataset-2
work-pool/job-dataset/zvol
Once a job has been completed, we archive the whole job to a remote system, currently running Debian 9 with the OpenZFS filesystem kernel modules, using the
zfs send | zfs receive
commands over ssh, as follow:Code:
# zfs set readonly=on work-pool/job-dataset # zfs snapshot -r work-pool/job-dataset@DATE # zfs send -R work-pool/job-dataset@DATE | ssh root@arc.hiv.serv.ip "zfs receive -F archive-pool/job-dataset"
We haven't encountered any problem using this technique until recently.
For an unidentified reason, some datasets are not getting transferred correctly anymore, while others behave just as expected and get transferred without a problem.
What can be observed with the problematic datasets is as follow:
- the recursive snapshots are created as expected
- the
zfs send -R ... | ssh root@...ip... "zfs receive -F ..."
command gets executed - two tasks are appearing on the sending (FreeNAS) system:
zfs send -R ...
ssh root@...ip... zfs receive -F ...
- one task is appearing on the receiving (Debian 9) system
zfs receive -F ...
- transcient network activity is observed and the "parent" dataset is created on the receiving system but then it hangs and the children datasets never get transferred
- network activity stops but the zfs tasks are still active, although cpu usage of the respective tasks are incoherent, using 0% cpu on the sending (FreeNAS) system while displaying 100% cpu usage on the receiving (Debian 9) system:
- SENDING SYSTEM:
cpu 0%zfs send -R work-pool/job-dataset@DATE
cpu 0%ssh root@...ip... zfs receive -F archive-pool/job-dataset
- RECEIVING SYSTEM:
cpu 100%zfs receive -F archive-pool/job-dataset
- SENDING SYSTEM:
- killing the process using it's process id on the sending system is possible:
kill -9 ...id...
warning: cannot send 'work-pool/job-dataset@DATE': signal received
Killed
- killing the process on the receiving system is impossible (whatever kill signal is sent), the ssh process keeps running using 100% of one processor thread until the system is rebooted
Sending / receiving the problematic datasets on the same pool (duplicating) works as expected.
Both pools are healthy and have been scrubbed without error.
We were unable to establish any difference between datasets that get transferred without problem and those that stall the transfer, they all are created by duplicating a "master template" dataset using:
Code:
# zfs send -R work-pool/parent/dataset-1@source | zfs receive -F work-pool/job-dataset/dataset-1 # zfs send -R work-pool/parent/dataset-2@source | zfs receive -F work-pool/job-dataset/dataset-2 # zfs send -R work-pool/parent/zvol@source | zfs receive -F work-pool/job-dataset/zvol
Any suggestion, idea or further direction to investigate this problem would be more than welcome!
Thank you.
Last edited: