False Leap-Second - 2006  July 01

On 2006 July 01 I noticed that the timekeeping on two of my PCs had gone wild.  The offset graphs (shown below) were similar to ones I had seen before, with the value in the drift file being set to a large positive values (more than 400 ppm) and values which were grossly incorrect.  Stopping NTP, restoring a correct drift value to the file ntp.drift, and restarting NTP cured the problem.  I did this at about 07:30 clock.  Noticing that the transient had started at 01:00 clock (00:00 UTC), I wondered if it had any connection with leap seconds.  Sure enough, on looking in the Windows Event Log for the PCs in question, on both problem PCs, at 01:00 a positive leap second is inserted. Arrgh!  

The announcement wasn't coming from my own stratum 1 server (at least I hope it wasn't, as not all client PCs were affected).  On PC Bacchus, a positive leap second was detected by NTP at 09:10 (clock) on June 06, the event log does not show which server provided this duff information.  The NTP on that server was restarted on March 04 and was using NTP UK pool servers, plus ntp2c.mcc.ac.uk.  On PC Stamsund, a positive leap second was detected by NTP at 09:17 (clock) on June 6. Its servers included UK pool servers, and 130.88.200.6 (utserv.mcc.ac.uk).  (Interestingly, those are the two PCs which I didn't touch after the leap-second problems at the start of the year.  Coincidence?  

On checking with the: ntpq -c rv <host-name>  command neither the utserv.mcc.ac.uk server nor my own simple stratum-1 server was showing a leap-second flag (leap=00).  (Thanks to David Malone for the syntax of this command - he checked a number of servers after the 2006 leap-second issue).  Karel Sandler reported: "According to the www.pool.ntp.org there are 57 UK servers today.  All these servers (3 S1, 32 S2, 20 S3 and 2 S4) have 'leap 00' (at Jul 1 23:36 UTC) and all those three S1 were OK according to the pool scores.  But - one more server (S1) has been there until Jul 1 03 AM UTC.  Maybe, don't know."

So I am not 100% sure if the July transient was a hangover from the January problems, or if some servers were incorrectly sending out a positive-leap-second-is-due announcement.  I will make the following suggestions:

- to the NTP Pool managers: that the servers in the pool should be monitored for spurious leap-second announcements

- to the NTP Developers: that NTP be more robust before acting on the leap-second announcement from a single server.

My thanks to those who helped diagnose the problem.

Screen-shot of the July 2006 transient on my PCs

screen-shot


NTP Leap-Second Behaviour - 2005-2006

The end of December 2005 was the first occasion for several years where a leap second was inserted in the UTC time scale to bring it back into line with the Earth's rotation.  The software I use to keep my PCs' time correct, NTP, has full provision for handling the leap seconds, and can also control the computer either via the Unix kernel or by the Windows SetSystemTimeAdjustment routine, so that the leap second is handled smoothly.  However, it seems that not all external systems were running the current software, or they they know about the leap second correctly.  I was running two versions of the NTP software for Windows - both NTP 4.2.0b.  On Bacchus and Hermes I had Meinberg version 1431, and on Odin and Stamsund Meinberg version 1436 - a beta test version which was able to correctly insert the leap-second on Windows, a function which the basic OS lacks.

What happened

Rather than being out on the streets of Edinburgh celebrating the New Year, I watched how the different systems handled the leap-second!  It seemed that about half the Internet servers I sync to inserted the leap-second, but about half did not.  This confused NTP, as it did not know which of the two clusters of servers were telling the correct time.  Like many NTP users, I have no reference time source other than the Internet.  This appeared to result in NTP assuming the worst - that the computers clock rate was way off, and it thought that the clock drift was nearly 500 parts per million, the maximum it could be.  As a result, the timekeeping was all over the place (although mostly within the 128ms range NTP allows before it steps the computer clock).  I left the computer in this state around 01:30 UTC.

At 07:40 UTC I returned to see how things were going.  Two of the PCs (Bacchus and Stamsund) whilst still having errors greatly in excess of their normal levels, at least had more reasonable (non-limiting) drift values, so I left them to try and sort themselves out.  The other two PCs (Hermes and Odin) were still showing grossly incorrect drift values (as seen in the file ntp.drift), and were having gross time errors.  Consequently, I decided to stop the NTP client, delete the drift file, and restart the client on those PCs.  This resulted in Hermes setting down correctly with a couple of hours, but Odin took somewhat longer (10 hours) to determine a sensible drift value.

Conclusions

It's difficult to know what conclusions to draw, as I am not an NTP expert!

  • If I had a local reference clock, would the system have behaved differently?  Perhaps.  Some people did report, though, that their accurate and correctly leap-second transitioned reference clock was ignored and NTP saw many external clocks without the leap second, and discarded the accurate local reference clock.
  • Did having the newer software which transitioned the OS cleanly on Odin and Stamsund make any difference?  Difficult to say, but note that the two systems without the code (Bacchus and Hermes) showed an initial positive offset (they lacked the leap second and were therefore ahead of the new correct time), whereas the two systems with the software (Odin and Stamsund) showed an initial negative offset.  So I think that the software did make a difference.
  • How would the system have behaved if all external reference sources had inserted the leap second correctly?  I can't say, but one could only hope it would have been better!
  • Work is needed by the owners of many NTP servers the ensure that their servers do insert the leap-second notice correctly, and that the servers themselves perform that transition smoothly.
  • I might have been better off with the beta leap-second compliant software installed on my Internet facing servers rather than the local clients.  I hadn't done this because it was beta software!  If I had, the Internet-facing servers might have rejected the servers which were in error, and the clients might have been less confused.
  • PC Bacchus shows an interesting recovery curve - you can see a high rate error from about 04:00-07:00 followed by a reset, then a smaller rate error from 08:00-14:00 and another reset, an even smaller rate error from 16:00-21:00 and another reset.  This would appear to be NTP iterating to determine the correct drift value.

My thanks to Martin Burnicki of Meinberg, Germany, for providing the update to NTP for Windows which inserted the leap second.  Martin also compared his code against a precision source, and the most interesting results appear here.

Screen-shot of the January 2006 transient on my four PCs


NTP Glitch - December 2005

What happened

Briefly, something happened on 2005 December 02 (a software change probably), which caused NTP on my Windows XP Pro PC to show much more instability than I had seen since the installation of XP Service Pack 2.  After much useful discussion on the comp.protocols.time.ntp newsgroup, I determined that running the MultiMedia timer continuously at 1ms resolution, cured the instability, which appears to be from the OS making time steps both when changing from regular to MM timer mode and when changing back again.  I modified one of my own programs to provide a function to enable the MM Timer, and Martin Burnicki of Meinberg, Germany, provided a version of the Windows NTP software where that function could be enabled while the NTP software was running.  I subsequently installed that version of the NTP software on Odin, where it completely cured the problem, and also on a Windows 2000 PC (Stamsund) which had always exhibited instability.  It seems that running certain software (something like JavaScript or Flash under Internet Explorer) could set and reset the MM Timer mode, and thereby cause a timekeeping instability.

The start of the problem - 2005, Friday 2nd December

Curing NTP instability on a Windows 2000 system

The instability of this system was completely cured by having the MM Timer running continuously, avoiding the glitches when it was enabled or disabled.  The same software restored the Windows XP system to stable timekeeping.  The glitch at the end of the graph is the leap-second issue described above.  The Windows version of NTP has now been modified to include the -M parameter at startup which enables the MM timer, and thus provides much better timekeeping (on systems where this is a problem).


Typical Drift Values

This is just for my own reference, but I might as well record it here as drift came into the leap-second discussion:

  • Bacchus (AMD 266): -74.237
  • Bacchus (Pentium III): -5.629
  • Gemini: -15.240, 3.113
  • Hermes: -95.345, -92.276
  • Hydra: 13.395, 12.958
  • Narvik: 11.077
  • Odin: -8.022
  • Stamsund: -11.822, -11,354

Related Links

 

 
Copyright © David Taylor, Edinburgh Last modified: 2007 Dec 23 at 15:16