Server time drift

mpfusion · May 31, 2018

We have two servers running the same FreeNAS version. On one server the time drifts over time up to multiple days offset. NTP is used on both servers (as is default on FreeNAS).

Server 1 (the one with the time offset):

 freenas1# ntpq -pn

	 remote		   refid	  st t when poll reach   delay   offset  jitter

==============================================================================

 185.28.157.12   240.67.35.84	 3 u	-   64	1   18.225  3886545 5130.24

 88.198.24.72	195.66.241.2	 2 u	1   64	1   20.026  3886536 5128.82

 131.188.3.220   .GPS.			1 u	1   64	1   21.999  3886536 5128.80

 176.9.1.211	 235.106.237.243  3 u	-   64	1   19.744  3886545 6207.75

Server 2 (the working one):

 freenas2# ntpq -pn

	 remote		   refid	  st t when poll reach   delay   offset  jitter

==============================================================================

*185.207.104.70  124.216.164.14   2 u  299  512  377   16.172	0.723   0.329

+212.83.32.100   131.188.3.222	2 u  126  512  377   13.670	2.364   0.335

+217.79.179.106  130.149.17.8	 2 u  294  512  377   12.011   -2.249   2.043

What can cause these multi-day time shifts and how to fix that?

Build: FreeNAS-11.1-U4
Platform: Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz
Memory: 130925MB

jgreco · May 31, 2018

WOW!! Look at that jitter and offset, and poor reach?

Did you just reset ntpd on freenas1? If not, I'd have to say that it seems like there's some communications problem and it is never settling on a time source. Note the first character in the data from freenas2, which shows a series of characters that indicate sync status ("tally code"). A working configuration should result in something like this:

Code:

	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
+ntp1.sol.net	128.252.19.1	 2 u	7 1024  377   32.782	0.101   1.087
*ntp2.sol.net	209.51.161.238   2 u  169 1024  377   31.902	1.360   1.484
-ntp3.sol.net	129.6.15.29	  2 u  465 1024  377	0.390	2.716   1.376
+ntp4.sol.net	198.60.22.240	2 u  730 1024  377	0.355	0.249   0.438

I'm guessing the forumware ate the initial space that is likely at the beginning of each line for freenas1's ntpq. I would be looking at issues like packet loss or whether or not maybe the system is just overloaded. Also, PC hardware is notoriously fickle and you might want to review other posts here that discuss changing kern.timecounter.hardware, such as

https://forums.freenas.org/index.php?threads/time-out.38828/page-2#post-238229

because if you're maybe not using a decent server board, or old hardware, you might have some crappy issue going on that usually isn't seen on modern gear.

mpfusion · May 31, 2018

jgreco said:
WOW!! Look at that jitter and offset, and poor reach?

Did you just reset ntpd on freenas1?

I reset it some minutes ago (I issued service ntpd restart), to no avail. Offset and jitter still present.

jgreco said:
If not, I'd have to say that it seems like there's some communications problem and it is never settling on a time source. Note the first character in the data from freenas2, which shows a series of characters that indicate sync status ("tally code"). A working configuration should result in something like this:

Code:
remote refid st t when poll reach delay offset jitter ============================================================================== +ntp1.sol.net 128.252.19.1 2 u 7 1024 377 32.782 0.101 1.087 *ntp2.sol.net 209.51.161.238 2 u 169 1024 377 31.902 1.360 1.484 -ntp3.sol.net 129.6.15.29 2 u 465 1024 377 0.390 2.716 1.376 +ntp4.sol.net 198.60.22.240 2 u 730 1024 377 0.355 0.249 0.438

That looks similar to what the 2nd server outputs (see my OP).

jgreco said:
I'm guessing the forumware ate the initial space that is likely at the beginning of each line for freenas1's ntpq. I would be looking at issues like packet loss or whether or not maybe the system is just overloaded.

Indeed, the forum ate the initial spaces in front of the IPs. No general packet loss. Network is fine and lightly loaded. System load is low as well. Can it be a firewall issue? If yes, what do I need to look out for?

jgreco said:
Also, PC hardware is notoriously fickle and you might want to review other posts here that discuss changing kern.timecounter.hardware, such as

https://forums.freenas.org/index.php?threads/time-out.38828/page-2#post-238229

because if you're maybe not using a decent server board, or old hardware, you might have some crappy issue going on that usually isn't seen on modern gear.

Can I just change kern.timecounter.hardware on a production system to test or may it do harm? The board is a X10SRH-CLN4F which is supported by FreeNAS and is one of the recommended setups. So I don't think it's incompatible hardware.

Thanks for the quick response. Any help or further debugging assistance appreciated.

jgreco · May 31, 2018

Okay, well, then that should explain the poor reach. I would try letting it run for awhile and see what develops for jitter. Your reach bits should slowly work thru a pattern (it's an octal representation of a binary shift register) towards 0377. If it is recording reachability it is not likely to be a firewall issue, and as you note your second server looks healthy, so use both that and my example to judge health.

While you let it run for awhile, ideally it ought to decide on a favored clock source and mark it as the system peer ("*") and the others probably as candidate ("+"). While it is doing this, and once it starts skewing, it would be interesting to see if there are any messages in the syslog about problems. These things might give clues as to what is awry, and this would be helpful to a successful correction.

My recollection is that you can probably change between ACPI-fast, HPET, and TSC-low without too much drama while running, but it is still a fundamental and unexpected change to the OS, so you're best off waiting outside any important FreeNAS activities to try it.

mpfusion · May 31, 2018

jgreco said:
Okay, well, then that should explain the poor reach. I would try letting it run for awhile and see what develops for jitter. Your reach bits should slowly work thru a pattern (it's an octal representation of a binary shift register) towards 0377. If it is recording reachability it is not likely to be a firewall issue, and as you note your second server looks healthy, so use both that and my example to judge health.

While you let it run for awhile, ideally it ought to decide on a favored clock source and mark it as the system peer ("*") and the others probably as candidate ("+"). While it is doing this, and once it starts skewing, it would be interesting to see if there are any messages in the syslog about problems. These things might give clues as to what is awry, and this would be helpful to a successful correction.

Thanks for explaining how to interpret the output. I'm a noob when it comes to ntpd, it has always just worked, so I never bothered to learn the details.

jgreco said:
My recollection is that you can probably change between ACPI-fast, HPET, and TSC-low without too much drama while running, but it is still a fundamental and unexpected change to the OS, so you're best off waiting outside any important FreeNAS activities to try it.

I turned it off for the time being and set the clock manually as a hot fix, because users were complaining. I turned it back on now and will report any findings/changes the following days.

mpfusion · May 31, 2018

These are the values after approx. 15min of runtime:

Code:

freenas1# ntpq -pn
	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
 78.46.204.247   242.71.143.169   2 u	8   64  377   19.129  6761029 282874.
 148.251.68.100  131.188.3.222	2 u   21   64  377   15.606  6748923 278664.
 85.214.38.116   192.53.103.108   2 u	6   64  377   14.070  6762918 283192.
 129.250.35.250  249.224.99.213   2 u   20   64  377   19.674  6749837 278999.

TIL: the CMD tags eat spaces, CODE tags do not.

No messages in syslog so far, except that ntpd started.

jgreco · May 31, 2018

mpfusion said:
Thanks for explaining how to interpret the output. I'm a noob when it comes to ntpd, it has always just worked, so I never bothered to learn the details.

Don't feel bad. I do infrastructure engineering for ntp.org but that doesn't really mean I know much more than you; I think the output is a bit arcane and unhelpful.

I think there's probably some problem with the host platform's timecounter and historically the "fix" is to try a different one, and if a different one works, wire that in as a boot-time sysctl. Absent any further interesting developments, that's about all I can suggest.

mpfusion · May 31, 2018

I checked the man page and interwebz and now have a grasp of what the columns mean. Looks all fine to me, except the missing designator (+, *), the offset and the high jitter. Which seems to be the key issue here.

I went and changed

kern.timecounter.hardware=ACPI-fast

which basically did nothing. Then I tried

kern.timecounter.hardware=HPET

and now got this (after 10min runtime):

Code:

freenas1# ntpq -pn
	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
*195.50.171.101  145.253.3.52	 2 u   30   64  377   26.328  -11.973   4.100
+82.100.248.10   213.172.96.14	2 u   15   64  377   25.180  -14.257   4.307
+78.46.53.2	  130.149.17.8	 2 u   23   64  377   29.573  -10.382   2.883

(I deleted one time source which I added just for testing, that's why there's one address less compared to the other listings.)

This looks much better, I think. I still have no clue what happened here and how I fixed it. I'll read up about HPET, and stuff. Never heard of that before. It's still strange that it's required on a supported standard platform (and only on one server).

Thanks a lot for your assistance. How to I make this change permanent? Is System → Tunables → Add Tunable → Sysctl the right spot?

jgreco · Jun 2, 2018

mpfusion said:
This looks much better, I think. I still have no clue what happened here and how I fixed it. I'll read up about HPET, and stuff. Never heard of that before. It's still strange that it's required on a supported standard platform (and only on one server).

Hard to know. The only remediation I can easily think of would be to update to the latest BIOS, then reset the BIOS to defaults and then make normal changes like "power on after power fail" and avoid touching any chipset options. It's possible that some change to the ACPI configuration has resulted in odd behaviour. I'm not really an expert in the arcana of the modern PC's various subsystems, but I can tell you that it's very difficult to write code that works reliably across a slew of platforms especially when there's a whole bunch of configuration changes that users can make in the BIOS, many of which make substantial changes in how the platform operates.

Everything that gets built or reworked here in the shop gets reset back to defaults and then carefully tweaked for necessary changes. I've seen really insane behaviour teased out of systems through misadventures in random BIOS setting changes.

If that doesn't "fix" it, then I would say there's something fundamentally broken on the host platform's mainboard, as the board you're using should be fine with FreeBSD. If that's the case, it's probably best to just use HPET and find some more interesting way to waste time.

Thanks a lot for your assistance. How to I make this change permanent? Is System → Tunables → Add Tunable → Sysctl the right spot?

Yes, that should work. Verify after rebooting that the change was applied, just to be certain, by running "sysctl kern.timecounter.hardware" and it should report whatever you set.

mpfusion · Jun 4, 2018

Thanks for your help. When the server is down anyway I'll upgrade the BIOS, but for the time being, I just leave it on HPET. Seems to work so far.

Important Announcement for the TrueNAS Community.

Server time drift

mpfusion

Contributor

jgreco

Resident Grinch

mpfusion

Contributor

jgreco

Resident Grinch

mpfusion

Contributor

mpfusion

Contributor

jgreco

Resident Grinch

mpfusion

Contributor

jgreco

Resident Grinch

mpfusion

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Server time drift

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Server time drift"

Similar threads