Server time drift

Status
Not open for further replies.

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
We have two servers running the same FreeNAS version. On one server the time drifts over time up to multiple days offset. NTP is used on both servers (as is default on FreeNAS).

Server 1 (the one with the time offset):

freenas1# ntpq -pn
remote refid st t when poll reach delay offset jitter
==============================================================================
185.28.157.12 240.67.35.84 3 u - 64 1 18.225 3886545 5130.24
88.198.24.72 195.66.241.2 2 u 1 64 1 20.026 3886536 5128.82
131.188.3.220 .GPS. 1 u 1 64 1 21.999 3886536 5128.80
176.9.1.211 235.106.237.243 3 u - 64 1 19.744 3886545 6207.75


Server 2 (the working one):

freenas2# ntpq -pn
remote refid st t when poll reach delay offset jitter
==============================================================================
*185.207.104.70 124.216.164.14 2 u 299 512 377 16.172 0.723 0.329
+212.83.32.100 131.188.3.222 2 u 126 512 377 13.670 2.364 0.335
+217.79.179.106 130.149.17.8 2 u 294 512 377 12.011 -2.249 2.043


What can cause these multi-day time shifts and how to fix that?

Build: FreeNAS-11.1-U4
Platform: Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz
Memory: 130925MB
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
WOW!! Look at that jitter and offset, and poor reach?

Did you just reset ntpd on freenas1? If not, I'd have to say that it seems like there's some communications problem and it is never settling on a time source. Note the first character in the data from freenas2, which shows a series of characters that indicate sync status ("tally code"). A working configuration should result in something like this:

Code:
	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
+ntp1.sol.net	128.252.19.1	 2 u	7 1024  377   32.782	0.101   1.087
*ntp2.sol.net	209.51.161.238   2 u  169 1024  377   31.902	1.360   1.484
-ntp3.sol.net	129.6.15.29	  2 u  465 1024  377	0.390	2.716   1.376
+ntp4.sol.net	198.60.22.240	2 u  730 1024  377	0.355	0.249   0.438


I'm guessing the forumware ate the initial space that is likely at the beginning of each line for freenas1's ntpq. I would be looking at issues like packet loss or whether or not maybe the system is just overloaded. Also, PC hardware is notoriously fickle and you might want to review other posts here that discuss changing kern.timecounter.hardware, such as

https://forums.freenas.org/index.php?threads/time-out.38828/page-2#post-238229

because if you're maybe not using a decent server board, or old hardware, you might have some crappy issue going on that usually isn't seen on modern gear.
 

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
WOW!! Look at that jitter and offset, and poor reach?

Did you just reset ntpd on freenas1?

I reset it some minutes ago (I issued service ntpd restart), to no avail. Offset and jitter still present.

If not, I'd have to say that it seems like there's some communications problem and it is never settling on a time source. Note the first character in the data from freenas2, which shows a series of characters that indicate sync status ("tally code"). A working configuration should result in something like this:

Code:
	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
+ntp1.sol.net	128.252.19.1	 2 u	7 1024  377   32.782	0.101   1.087
*ntp2.sol.net	209.51.161.238   2 u  169 1024  377   31.902	1.360   1.484
-ntp3.sol.net	129.6.15.29	  2 u  465 1024  377	0.390	2.716   1.376
+ntp4.sol.net	198.60.22.240	2 u  730 1024  377	0.355	0.249   0.438

That looks similar to what the 2nd server outputs (see my OP).

I'm guessing the forumware ate the initial space that is likely at the beginning of each line for freenas1's ntpq. I would be looking at issues like packet loss or whether or not maybe the system is just overloaded.

Indeed, the forum ate the initial spaces in front of the IPs. No general packet loss. Network is fine and lightly loaded. System load is low as well. Can it be a firewall issue? If yes, what do I need to look out for?

Also, PC hardware is notoriously fickle and you might want to review other posts here that discuss changing kern.timecounter.hardware, such as

https://forums.freenas.org/index.php?threads/time-out.38828/page-2#post-238229

because if you're maybe not using a decent server board, or old hardware, you might have some crappy issue going on that usually isn't seen on modern gear.

Can I just change kern.timecounter.hardware on a production system to test or may it do harm? The board is a X10SRH-CLN4F which is supported by FreeNAS and is one of the recommended setups. So I don't think it's incompatible hardware.

Thanks for the quick response. Any help or further debugging assistance appreciated.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Okay, well, then that should explain the poor reach. I would try letting it run for awhile and see what develops for jitter. Your reach bits should slowly work thru a pattern (it's an octal representation of a binary shift register) towards 0377. If it is recording reachability it is not likely to be a firewall issue, and as you note your second server looks healthy, so use both that and my example to judge health.

While you let it run for awhile, ideally it ought to decide on a favored clock source and mark it as the system peer ("*") and the others probably as candidate ("+"). While it is doing this, and once it starts skewing, it would be interesting to see if there are any messages in the syslog about problems. These things might give clues as to what is awry, and this would be helpful to a successful correction.

My recollection is that you can probably change between ACPI-fast, HPET, and TSC-low without too much drama while running, but it is still a fundamental and unexpected change to the OS, so you're best off waiting outside any important FreeNAS activities to try it.
 

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
Okay, well, then that should explain the poor reach. I would try letting it run for awhile and see what develops for jitter. Your reach bits should slowly work thru a pattern (it's an octal representation of a binary shift register) towards 0377. If it is recording reachability it is not likely to be a firewall issue, and as you note your second server looks healthy, so use both that and my example to judge health.

While you let it run for awhile, ideally it ought to decide on a favored clock source and mark it as the system peer ("*") and the others probably as candidate ("+"). While it is doing this, and once it starts skewing, it would be interesting to see if there are any messages in the syslog about problems. These things might give clues as to what is awry, and this would be helpful to a successful correction.

Thanks for explaining how to interpret the output. I'm a noob when it comes to ntpd, it has always just worked, so I never bothered to learn the details.

My recollection is that you can probably change between ACPI-fast, HPET, and TSC-low without too much drama while running, but it is still a fundamental and unexpected change to the OS, so you're best off waiting outside any important FreeNAS activities to try it.

I turned it off for the time being and set the clock manually as a hot fix, because users were complaining. I turned it back on now and will report any findings/changes the following days.
 

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
These are the values after approx. 15min of runtime:

Code:
freenas1# ntpq -pn
	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
 78.46.204.247   242.71.143.169   2 u	8   64  377   19.129  6761029 282874.
 148.251.68.100  131.188.3.222	2 u   21   64  377   15.606  6748923 278664.
 85.214.38.116   192.53.103.108   2 u	6   64  377   14.070  6762918 283192.
 129.250.35.250  249.224.99.213   2 u   20   64  377   19.674  6749837 278999.


TIL: the CMD tags eat spaces, CODE tags do not.

No messages in syslog so far, except that ntpd started.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Thanks for explaining how to interpret the output. I'm a noob when it comes to ntpd, it has always just worked, so I never bothered to learn the details.

Don't feel bad. I do infrastructure engineering for ntp.org but that doesn't really mean I know much more than you; I think the output is a bit arcane and unhelpful.

I think there's probably some problem with the host platform's timecounter and historically the "fix" is to try a different one, and if a different one works, wire that in as a boot-time sysctl. Absent any further interesting developments, that's about all I can suggest.
 

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
I checked the man page and interwebz and now have a grasp of what the columns mean. Looks all fine to me, except the missing designator (+, *), the offset and the high jitter. Which seems to be the key issue here.

I went and changed

kern.timecounter.hardware=ACPI-fast

which basically did nothing. Then I tried

kern.timecounter.hardware=HPET

and now got this (after 10min runtime):

Code:
freenas1# ntpq -pn
	 remote		   refid	  st t when poll reach   delay   offset  jitter
==============================================================================
*195.50.171.101  145.253.3.52	 2 u   30   64  377   26.328  -11.973   4.100
+82.100.248.10   213.172.96.14	2 u   15   64  377   25.180  -14.257   4.307
+78.46.53.2	  130.149.17.8	 2 u   23   64  377   29.573  -10.382   2.883


(I deleted one time source which I added just for testing, that's why there's one address less compared to the other listings.)

This looks much better, I think. I still have no clue what happened here and how I fixed it. I'll read up about HPET, and stuff. Never heard of that before. It's still strange that it's required on a supported standard platform (and only on one server).

Thanks a lot for your assistance. How to I make this change permanent? Is System → Tunables → Add Tunable → Sysctl the right spot?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This looks much better, I think. I still have no clue what happened here and how I fixed it. I'll read up about HPET, and stuff. Never heard of that before. It's still strange that it's required on a supported standard platform (and only on one server).

Hard to know. The only remediation I can easily think of would be to update to the latest BIOS, then reset the BIOS to defaults and then make normal changes like "power on after power fail" and avoid touching any chipset options. It's possible that some change to the ACPI configuration has resulted in odd behaviour. I'm not really an expert in the arcana of the modern PC's various subsystems, but I can tell you that it's very difficult to write code that works reliably across a slew of platforms especially when there's a whole bunch of configuration changes that users can make in the BIOS, many of which make substantial changes in how the platform operates.

Everything that gets built or reworked here in the shop gets reset back to defaults and then carefully tweaked for necessary changes. I've seen really insane behaviour teased out of systems through misadventures in random BIOS setting changes.

If that doesn't "fix" it, then I would say there's something fundamentally broken on the host platform's mainboard, as the board you're using should be fine with FreeBSD. If that's the case, it's probably best to just use HPET and find some more interesting way to waste time. :smile:

Thanks a lot for your assistance. How to I make this change permanent? Is System → Tunables → Add Tunable → Sysctl the right spot?

Yes, that should work. Verify after rebooting that the change was applied, just to be certain, by running "sysctl kern.timecounter.hardware" and it should report whatever you set.
 

mpfusion

Contributor
Joined
Jan 6, 2014
Messages
198
Thanks for your help. When the server is down anyway I'll upgrade the BIOS, but for the time being, I just leave it on HPET. Seems to work so far.
 
Status
Not open for further replies.
Top