Timekeeping in bhyve

rvassar · Oct 24, 2018

So I have one VM I run on my NAS, that provides some network services. It works well enough, but it can't seem to keep time to save itself. I have NTP configured, and find it consistently falls behind. After a week of not paying attention, it was 24 seconds behind. I've tried fiddling with the NTP config, but it appears to have no ability to slew at all, and only seems to correct via step. To be honest I'm not really even sure about that. It does seem to allow ntpdate to set time, but I'm not really certain ntpd itself has any ability to effect change.

FreeNAS itself keeps time via NTP quite nicely. If there's a config option to simply pass thru to the base OS, I'll take it and forget it.

rvassar · Oct 24, 2018

So I've been playing with ntp.conf, and I have the beginning of a fix. It's far from perfect, but it seems to keep time within about 0.350 seconds, which is good enough for my purposes, but I suspect I can improve on it.

I inserted two lines at the beginning of ntp.conf in my VM (NOT on FreeNAS!):

tinker step 0.256
tinker panic 0

The first enables and doubles amount of "jump" ntpd will implement if it feels it needs to step the clock. The default is 0.128, but depending on distro stepping may be disabled. Since I suspect the clock slew call is returning false success, I didn't think this could be any worse.

The second line turns off the panic resulting from clock grooming. Usually if a clock source offset + jitter exceeds the ntp.drift value, it tosses the data. I suspect the default configuration was tossing the entire collected pool of time sources, and just running free.

The downside... My clock is stepping. I'm logging lines in syslog roughly every 30 minutes that warn:
Oct 24 20:21:53 (vm host) systemd[1]: Time has been changed

I am repeating my experiment with "step 0.170" to see if I can tighten up the wobble.

jgreco · Oct 24, 2018

tinker panic 0 is the normal remediation for virtual machines. At least on VMware ESXi, to the best of my knowledge, slew works fine. Obviously if slew is unimplemented or doesn't work correctly in bhvye, ntpd will lose its $#!+. VMware's strategy for this includes allowing VMware Tools to manually synchronize this with the hypervisor's time (vmware-toolbox-cmd timesync) or to run its own sync tools. But there's a bunch of edge cases, and doing things like live migrations also can cause ntpd to lose it, hosts with securelevel set have a different set of issues, etc. bhyve is pretty new and may not have worked all this stuff out.

I haven't played with this in bhyve but I have discussed the general issue with NTP developers, and there may be some hope for fixes on the ntpd side of things in the future. If this isn't just magically working, then I'd also consider it an issue in bhyve because time is increasingly important for things like SSL to operate correctly.

rvassar · Oct 24, 2018

Well... I wouldn't say ntpd was loosing it's !?!!#%!, but it was certainly not doing it's job. You bring up a good point, all my ESXi VM's have the VMWare tools installed... I'm currently wondering if the ntp "pool <host> iburst" lines are making things worse or not...

The sad thing is ntpd is some of the oldest code in critical service on the Internet. I was in high school when most of it was written, and I'm pushing 50... Harlan Stenn was Dr Mills research assistant for years at udel.edu, but... Dr. Mills went blind, and as Prof. emeritus, he no longer has any budget for research staff... Harlan is a maintenance staff of one, living in Oregon last I heard. There were a couple people I knew at Sun that helped as best they could thru the '00's, but... Sun is no more... It got to the point it made the news rags, and sorta got fixed by throwing money at Harlan (which was good!...), but it wasn't a long term fix. In 5 -7 years Harlan is going to retire, and it's going to be a problem...

jgreco · Oct 24, 2018

rvassar said:
Well... I wouldn't say ntpd was loosing it's !?!!#%!, but it was certainly not doing it's job. You bring up a good point, all my ESXi VM's have the VMWare tools installed... I'm currently wondering if the ntp "pool <host> iburst" lines are making things worse or not...

Not.

The sad thing is ntpd is some of the oldest code in critical service on the Internet. I was in high school when most of it was written, and I'm pushing 50... Harlan Stenn was Dr Mills research assistant for years at udel.edu, but... Dr. Mills went blind, and as Prof. emeritus, he no longer has any budget for research staff... Harlan is a maintenance staff of one, living in Oregon last I heard. There were a couple people I knew at Sun that helped as best they could thru the '00's, but... Sun is no more... It got to the point it made the news rags, and sorta got fixed by throwing money at Harlan (which was good!...), but it wasn't a long term fix. In 5 -7 years Harlan is going to retire, and it's going to be a problem...

Yes, it is, and it's more complicated than that. Back in the day, Paul Vixie got sorta wealthy and decided he was going to found the Internet {Software,Systems} Consortium, in part to help sponsor development of things like NTP. That didn't really work out in the long run, and the rack of gear that ran NTP.ORG at ISC was no longer welcome there (~2014-2015). And the rack was basically a catastrophe, a pile of old gear with hacks upon hacks, significant security design issues, etc. As a critical infrastructure project, it deserved more than having mixed hosting/development systems with one leg on the live Internet. Further, as NTP is dependent on the kind generosity of others to provide bandwidth and colocation space, it was clear a new design was needed. So a number of hosting companies, such as Markley, ServerCentral, and Sonic Internet, have contributed colocation space and bandwidth, and NTP now has a new multi-homed design that is not dependent on any single provider, with a network designed to provide facilities such as a DMZ for hosting operations, lower security areas for developers to collaborate, etc.

Unfortunately, the NTP project is still very short on volunteers to do sysadmin stuff and all the busywork you'd expect, and there aren't as many developers as there ought to be.

rvassar · Oct 25, 2018

SSL & I was actually thinking Kerberos would check out completely... I forget what the time constraint there is, 4 minutes or so?

Thanks for the info on ISC... I remember wondering what was happening behind the scenes back then. One of the real problems with a rack of old gear, it eventually it begins to present an electrical fault risk to adjacent racks, and consumes far more air conditioning than modern hardware. Another significant event in the decay of NTP is the rise of competitors. OpenNTP may not be as accurate, but it's usually good enough.

Almost forgot... So far, and rather unsurprisingly, reducing the step value simply shortens the length of time between step events.

rvassar · Oct 26, 2018

Just to close this out... Adjusting the step value just moves the event timing around. At "step 0.170", I get a time step every 18 minutes, vs a little less than 30 minutes at 0.256. So I'm going to get a sawtooth, no matter what, and it's really a matter of picking how far I'm willing to let it drift. Since the cadence is so consistent, I'm pondering making a manual adjustment to the ntp.drift file, as it doesn't seem to be recalculating it.

Code:

58417 52127.670 0.000000000 -500.000 0.000000060 1.124600 6
58417 52257.662 -0.019599000 -500.000 0.006929293 1.053336 6
58417 52435.673 -0.046058668 -500.000 0.011381013 1.000338 6
58417 52572.666 -0.073312851 -500.000 0.014359155 0.959370 6
58417 52696.669 -0.082051350 -500.000 0.013782502 0.922667 6
58417 52702.672 -0.119618756 -500.000 0.018510171 0.863208 6
58417 52769.668 -0.139762365 -500.000 0.018722146 0.831220 6
58417 52852.622 -0.142219821 -500.000 0.017534503 0.816358 6
58417 53120.663 -0.156414643 -500.000 0.017152640 1.167688 6
58417 53475.384 0.000000000 -500.000 0.000000060 1.092272 6
58417 53603.388 -0.086244366 -500.000 0.030491988 1.047877 6
58417 53720.418 -0.119757724 -500.000 0.030885827 1.023707 6
58417 53737.418 -0.143312984 -500.000 0.030067404 0.958966 6
58417 54020.386 -0.151944153 -500.000 0.028290543 1.275065 6
58417 54129.426 -0.167715931 -500.000 0.027044484 1.253388 6
58417 54563.130 0.000000000 -500.000 0.000000060 1.172437 6
58417 54729.164 -0.077733487 -500.000 0.027482938 1.129923 6
58417 54886.182 -0.123917194 -500.000 0.030455127 1.133676 6
58417 55111.178 -0.130241854 -500.000 0.028575786 1.227163 6
58417 55516.922 0.000000000 -500.000 0.000000060 1.147906 6
58417 55658.917 -0.049638244 -500.000 0.017549769 1.083993 6
58417 55783.943 -0.057199048 -500.000 0.016632523 1.025116 6
58417 55914.932 -0.069469099 -500.000 0.016151780 0.977898 6
58417 56180.929 -0.076707245 -500.000 0.015323800 1.010760 6
58417 56182.930 -0.079467028 -500.000 0.014367273 0.945485 6
58417 56184.935 -0.096508084 -500.000 0.014728066 0.884430 6
58417 56257.927 -0.168100680 -500.000 0.028818204 0.866783 6
58417 56652.668 0.000000000 -500.000 0.000000060 0.810801 6

jgreco · Oct 26, 2018

rvassar said:
Just to close this out... Adjusting the step value just moves the event timing around. At "step 0.170", I get a time step every 18 minutes, vs a little less than 30 minutes at 0.256. So I'm going to get a sawtooth, no matter what, and it's really a matter of picking how far I'm willing to let it drift. Since the cadence is so consistent, I'm pondering making a manual adjustment to the ntp.drift file, as it doesn't seem to be recalculating it.

Follow up if you figure that out. I have occasionally seen similar what seems to be non-recalculating behaviour.

rvassar · Oct 26, 2018

jgreco said:
Follow up if you figure that out. I have occasionally seen similar what seems to be non-recalculating behaviour.

Will do. The man page says it can take up to two days to recalculate it, so I'm thinking I wait till Monday.

rvassar · Oct 26, 2018

Decided not to wait till Monday... Note the 4th column is from the ntp.drift file, which hasn't varied for two days. I edited the drift file and set it to "1.00", and restarted... It's now trying to calculate a value for ntp drift.

Code:

58418 10979.686 0.000000000 -500.000 0.000000060 1.029942 6
58418 11097.700 -0.072724796 -500.000 0.025712098 0.980249 6
58418 11268.685 -0.079078113 -500.000 0.024156129 0.960198 6
58418 11433.053 -0.151389651 1.000 0.053524325 0.000000 6
58418 11437.048 -0.152927941 1.000 0.050070375 0.000000 6
58418 11439.045 -0.152814552 1.000 0.046836564 0.000000 6
58418 11441.008 -0.151990705 1.000 0.043812562 0.000000 6
58418 11442.048 -0.154337612 1.000 0.040991298 0.000000 6
58418 11495.047 -0.163541184 1.000 0.038481670 0.000000 6
58418 11820.672 0.000000000 1.000 0.000000060 0.000000 6
58418 11846.718 -0.054562769 1.000 0.019290852 0.000000 6
58418 11889.780 -0.055737328 1.000 0.018049717 0.000000 6
58418 11955.797 -0.069739177 1.000 0.017594736 0.000000 6
58418 11972.737 -0.077010001 1.000 0.016657911 0.000000 6
58418 11974.714 -0.082734622 1.000 0.015712946 0.000000 6
58418 11975.724 -0.085669652 1.000 0.014734700 0.000000 6
58418 12305.480 0.000000000 1.000 0.000000060 0.000000 6
58418 12470.440 -0.091713983 0.098 0.032425790 0.318900 6
58418 12510.433 -0.086722083 -0.109 0.030382853 0.307130 6
58418 12638.435 -0.108640441 -0.938 0.029458103 0.410383 6
58418 12793.476 -0.162522032 -2.439 0.033499412 0.655113 6
58418 13123.056 0.000000000 -2.439 0.000000060 0.612802 6
58418 13456.902 0.000000000 -2.439 0.000000060 0.573224 6
58418 13967.314 0.000000000 -2.439 0.000000060 0.536202 6
58418 14019.309 -0.048668225 -2.590 0.017206816 0.504398 6
58418 14050.280 -0.094799926 -2.765 0.022914666 0.475869 6

Still waiting to see if the clock settles down at some point. So far it's wobbling around to the tune of 1/2 a second... But the jitter is also in the 300 - 500ms range...

rvassar · Nov 1, 2018

Just to follow up... My ntp.drift file keep pegging at -500 ppm. I turned off NTP, took a timestamp using "ntpdate -q", and let the clock drift for a period of time (~45 minutes), then took another timestamp. Using 86400 seconds per day, I calculated what my observed "drift" was, in ppm. My calculations show a drift of -678 ppm, which I believe exceeds the ability of the API call ntpd uses to adjust time. I can keep it bouncing around within a half a second or so, but in tolerating the jumps it never lifts the query cycle above 64 seconds. Since I don't want to impose that kind of query load on the public NTP pools, I'm going to constrain it to my local NTP servers, and reconsider the workloads on this VM.

Important Announcement for the TrueNAS Community.

Timekeeping in bhyve

rvassar

Guru

rvassar

Guru

jgreco

Resident Grinch

rvassar

Guru

jgreco

Resident Grinch

rvassar

Guru

rvassar

Guru

jgreco

Resident Grinch

rvassar

Guru

rvassar

Guru

rvassar

Guru

Similar threads

Important Announcement for the TrueNAS Community.

Timekeeping in bhyve

Guru

Guru

Resident Grinch

Guru

Resident Grinch

Guru

Guru

Resident Grinch

Guru

Guru

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Timekeeping in bhyve"

Similar threads