NTP WG J. Burbank, Ed. Internet-Draft W. Kasch, Ed. Obsoletes: RFC 4330, RFC 1305 JHU/APL (if approved) J. Martin, Ed. Intended status: Standards Track Daedelus Expires: November 2, 2007 D. Mills U. Delaware May 2007 Network Time Protocol Version 4 Protocol And Algorithms Specification draft-ietf-ntp-ntpv4-proto-07 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on November 2, 2007. Copyright Notice Copyright (C) The IETF Trust (2007). Abstract The Network Time Protocol (NTP) is widely used to synchronize computer clocks in the Internet. This document describes NTP Version 4 (NTPv4), which is backwards compatible with NTP Version 3 (NTPv3) described in RFC 1305, as well as previous versions of the protocol. Burbank, et al. Expires November 2, 2007 [Page 1] Internet-Draft NTPv4 Specification May 2007 NTPv4 includes a modified protocol header to accommodate the Internet Protocol Version 6 address family. NTPv4 includes fundamental improvements in the mitigation and discipline algorithms which extend the potential accuracy to the tens of microseconds with modern workstations and fast LANs. It includes a dynamic server discovery scheme, so that in many cases specific server configuration is not required. It corrects certain errors in the NTPv3 design and implementation and includes an optional extension mechanism. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Requirements Notation . . . . . . . . . . . . . . . . . . 5 2. Modes of Operation . . . . . . . . . . . . . . . . . . . . . 5 3. Protocol Modes . . . . . . . . . . . . . . . . . . . . . . . 6 3.1. Dynamic Server Discovery . . . . . . . . . . . . . . . . 7 4. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. Implementation Model . . . . . . . . . . . . . . . . . . . . 10 6. Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 12 7. Data Structures . . . . . . . . . . . . . . . . . . . . . . . 16 7.1. Structure Conventions . . . . . . . . . . . . . . . . . . 16 7.2. Global Parameters . . . . . . . . . . . . . . . . . . . . 16 7.3. Packet Header Variables . . . . . . . . . . . . . . . . . 17 7.4. The Kiss-o'-Death Packet . . . . . . . . . . . . . . . . 23 7.5. NTP Extension Field Format . . . . . . . . . . . . . . . 24 8. On Wire Protocol . . . . . . . . . . . . . . . . . . . . . . 25 9. Peer Process . . . . . . . . . . . . . . . . . . . . . . . . 29 9.1. Peer Process Variables . . . . . . . . . . . . . . . . . 30 9.2. Peer Process Operations . . . . . . . . . . . . . . . . . 32 10. Clock Filter Algorithm . . . . . . . . . . . . . . . . . . . 36 11. System Process . . . . . . . . . . . . . . . . . . . . . . . 38 11.1. System Process Variables . . . . . . . . . . . . . . . . 38 11.2. System Process Operations . . . . . . . . . . . . . . . . 39 11.2.1. Selection Algorithm . . . . . . . . . . . . . . . . 42 11.2.2. Cluster Algorithm . . . . . . . . . . . . . . . . . 43 11.2.3. Combine Algorithm . . . . . . . . . . . . . . . . . 44 11.3. Clock Discipline Algorithm . . . . . . . . . . . . . . . 46 12. Clock Adjust Process . . . . . . . . . . . . . . . . . . . . 50 13. Poll Process . . . . . . . . . . . . . . . . . . . . . . . . 50 13.1. Poll Process Variables . . . . . . . . . . . . . . . . . 50 13.2. Poll Process Operations . . . . . . . . . . . . . . . . . 51 14. Simple Network Time Protocol (SNTP) . . . . . . . . . . . . . 53 15. Security Considerations . . . . . . . . . . . . . . . . . . . 54 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 55 17. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 56 18. References . . . . . . . . . . . . . . . . . . . . . . . . . 56 18.1. Informative References . . . . . . . . . . . . . . . . . 56 Burbank, et al. Expires November 2, 2007 [Page 2] Internet-Draft NTPv4 Specification May 2007 18.2. Normative References . . . . . . . . . . . . . . . . . . 57 Appendix A. Code Skeleton . . . . . . . . . . . . . . . . . . . 57 A.1. Global Definitions . . . . . . . . . . . . . . . . . . . 58 A.1.1. Definitions, Constants, Parameters . . . . . . . . . 58 A.1.2. Packet Data Structures . . . . . . . . . . . . . . . 61 A.1.3. Association Data Structures . . . . . . . . . . . . 62 A.1.4. System Data Structures . . . . . . . . . . . . . . . 64 A.1.5. Local Clock Data Structures . . . . . . . . . . . . 65 A.1.6. Function Prototypes . . . . . . . . . . . . . . . . 65 A.2. Main Program and Utility Routines . . . . . . . . . . . . 66 A.3. Kernel Input/Output Interface . . . . . . . . . . . . . . 69 A.4. Kernel System Clock Interface . . . . . . . . . . . . . . 69 A.5. Peer Process . . . . . . . . . . . . . . . . . . . . . . 71 A.5.1. receive() . . . . . . . . . . . . . . . . . . . . . 72 A.5.2. clock_filter() . . . . . . . . . . . . . . . . . . . 79 A.5.3. fast_xmit() . . . . . . . . . . . . . . . . . . . . 83 A.5.4. access() . . . . . . . . . . . . . . . . . . . . . . 85 A.5.5. System Process . . . . . . . . . . . . . . . . . . . 85 A.5.6. Clock Adjust Process . . . . . . . . . . . . . . . . 99 A.5.7. Poll Process . . . . . . . . . . . . . . . . . . . . 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 107 Intellectual Property and Copyright Statements . . . . . . . . . 108 Burbank, et al. Expires November 2, 2007 [Page 3] Internet-Draft NTPv4 Specification May 2007 1. Introduction This document defines the Network Time Protocol Version 4 (NTPv4), which is widely used to synchronize system clocks among a set of distributed time servers and clients. It describes the core architecture, protocol, state machines, data structures and algorithms. NTPv4 introduces new functionality to NTPv3, as described in [1], and functionality expanded from SNTPv4 as described in [2] (SNTPv4 is a subset of NTPv4). This document obsoletes [1], and [2]. While certain minor changes have been made in some protocol header fields, these do not affect the interoperability between NTPv4 and previous versions of NTP and SNTP. The NTP subnet model includes a number of widely accessible primary time servers synchronized by wire or radio to national standards. The purpose of the NTP protocol is to convey timekeeping information from these primary servers to secondary time servers and clients via both private networks and the public Internet. Precisely tuned algorithms mitigate errors that may result from network disruptions, server failures and possible hostile actions. Servers and clients are configured such that values flow towards clients from the primary servers at the root via branching secondary servers. The NTPv4 design overcomes significant shortcomings in the NTPv3 design, corrects certain bugs and incorporates new features. In particular, expanded NTP timestamp definitions encourage the use of the floating double data type throughout the implementation. As a result, the time resolution is better than one nanosecond and frequency resolution is less than one nanosecond per second. Additional improvements include a new clock discipline algorithm which is more responsive to system clock hardware frequency fluctuations. Typical primary servers using modern machines are precise within a few tens of microseconds. Typical secondary servers and clients on fast LANs are within a few hundred microseconds with poll intervals up to 1024 seconds, which was the maximum with NTPv3. With NTPv4, servers and clients are precise within a few tens of milliseconds with poll intervals up to 36 hours. The main body of this document describes the core protocol and data structures necessary to interoperate between conforming implementations. Appendix A contains additional detail in the form of a skeleton program, including data structures and code segments for the core algorithms as well as the mitigation algorithms used to enhance reliability and accuracy. While the skeleton program and other descriptions in this document apply to a particular implementation, they are not intended as the only way the required functions can be implemented. While the NTPv3 symmetric key authentication scheme described in this document has been carried Burbank, et al. Expires November 2, 2007 [Page 4] Internet-Draft NTPv4 Specification May 2007 over from NTPv3, the Autokey public key authentication scheme new to NTPv4 is described in [3]. The NTP protocol includes modes of operation described in Section 2 using data types described in Section 6 and data structures described in Section 7. The implementation model described in Section 5 is based on a threaded, multi-process architecture, although other architectures could be used as well. The on-wire protocol described in Section 8 is based on a returnable-time design which depends only on measured clock offsets, but does not require reliable message delivery. Reliable message delivery such as TCP[11] can actually make the delivered NTP packet less reliable since retries would increase the delay value and other errors. The synchronization subnet is a self-organizing, hierarchical, master-slave network with synchronization paths determined by a shortest-path spanning tree and defined metric. While multiple masters (primary servers) may exist, there is no requirement for an election protocol. This document includes material from [4], which contains flow charts and equations unsuited for RFC format. There is much additional information in [5], including an extensive technical analysis and performance assessment of the protocol and algorithms in this document. The reference implementation is available at www.ntp.org. The remainder of this document contains numerous variables and mathematical expressions. Some variables take the form of Greek characters, which are spelled out by their full case-sensitive name. For example DELTA refers to the uppercase Greek character, while delta refers to the lowercase character. Furthermore, subscripts are denoted with '_', for example theta_i refers to the lowercase Greek character theta with subscript i, or phonetically theta sub i. In this document all time values are in seconds (s), and all frequencies will be specified as fractional frequency offsets (FFO) (pure number). It is often convenient to express these FFOs in parts per million (ppm). 1.1. Requirements Notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [12]. 2. Modes of Operation An NTP implementation operates as a primary server, secondary server or client. A primary server is synchronized to a reference clock directly traceable to UTC (eg, GPS, Galileo, etc). A client Burbank, et al. Expires November 2, 2007 [Page 5] Internet-Draft NTPv4 Specification May 2007 synchronizes to one or more upstream servers, but does not provide synchronization to dependent clients. A secondary server has one or more upstream servers and one or more downstream servers or clients. All servers and clients who are fully NTPv4 compliant MUST implement the entire suite of algorithms described in this document. In order to maintain stability in large NTP subnets, secondary servers SHOULD be fully NTPv4 compliant. Alternative algorithms MAY be used, but their output MUST be identical to the algorithms described in this specification, or they MUST be used exclusively in a private network. 3. Protocol Modes There are three NTP protocol variants, symmetric, client/server and broadcast. Each is associated with an association mode as shown in Figure 1. In addition, persistent associations are mobilized upon startup and are never demobilized. Ephemeral associations are mobilized upon the arrival of a packet and are demobilized upon error or timeout. +-------------------+--------------+-------------+ | Association Mode | Assoc. Mode | Packet Mode | +-------------------+--------------+-------------+ | Symmetric Active | 1 | 1 or 2 | | Symmetric Passive | 2 | 1 | | Client | 3 | 4 | | Server | 4 | 3 | | Broadcast Server | 5 | 5 | | Broadcast Client | 6 | N/A | +-------------------+--------------+-------------+ Figure 1: Association and Packet Modes In the client/server variant a persistent client sends client (mode 3) packets to a server, which returns server (mode 4) packets. Servers provide synchronization to one or more clients, but do not accept synchronization from them. A server can also be a reference clock driver which obtains time directly from a standard source such as a GPS receiver or telephone modem service. In this varient, clients pull synchronization from servers. In the symmetric variant a peer operates as both a server and client using either a symmetric active or symmetric passive association. A persistent symmetric active association sends symmetric active (mode 1) packets to a symmetric active peer association. Alternatively, an ephemeral symmetric passive association can be mobilized upon arrival of a symmetric active packet with no matching association. That association sends symmetric passive (mode 2) packets and persists Burbank, et al. Expires November 2, 2007 [Page 6] Internet-Draft NTPv4 Specification May 2007 until error or timeout. Peers both push and pull synchronization to and from each other. For the purposes of this document, a peer operates like a client, so references to client imply peer as well. In the broadcast variant a persistent broadcast server association sends periodic broadcast server (mode 5) packets which can be received by multiple clients. Upon reception of a broadcast server packet without a matching association, an ephemeral broadcast client (mode 6) association is mobilized and persists until error or timeout. It is useful to provide an initial volley where the client operating in client mode exchanges several packets with the server, so as to calibrate the propagation delay and to run the Autokey security protocol, after which the client reverts to broadcast client mode. A broadcast server pushes synchronization to clients and other servers. Following loosely the conventions established by the telephone industry, the level of each server in the hierarchy is defined by a stratum number. Primary servers are assigned stratum one; secondary servers at each lower level are assigned stratum numbers one greater than the preceding level. As the stratum number increases, its accuracy degrades depending on the particular network path and system clock stability. Mean errors, measured by synchronization distances, increase approximately in proportion to stratum numbers and measured roundtrip delay. As a standard practice, timing network topology should be organized to avoid timing loops and minimize the synchronization distance. In NTP the subnet topology is determined using a variant of the Bellman- Ford distributed routing algorithm, which computes the shortest-path spanning tree rooted on the primary servers. As a result of this design, the algorithm automatically reorganizes the subnet, so as to produce the most accurate and reliable time, even when there are failures in the timing network. 3.1. Dynamic Server Discovery There are two special associations, manycast client and manycast server, which provide a dynamic server discovery function. There are two types of manycast client associations, persistent and ephemeral. The persistent manycast client sends client (mode 3) packets to a designated IPv4 or IPv6 broadcast or multicast group address. Designated manycast servers within range of the time-to-live (TTL) field in the packet header listen for packets with that address. If a server is suitable for synchronization, it returns an ordinary server (mode 4) packet using the client's unicast address. Upon receiving this packet, the client mobilizes an ephemeral client (mode 3) association. The ephemeral client association persists until Burbank, et al. Expires November 2, 2007 [Page 7] Internet-Draft NTPv4 Specification May 2007 error or timeout. A manycast client continues sending packets to search for a minimum number of associations. It starts with a TTL equal to one and continuously adding one to it until the minimum number of associations is made or when the TTL reaches a maximum value. If the TTL reaches its maximum value and yet not enough associations are mobilized, the client stops transmission for a time-out period to clear all associations, and then repeats the search cycle. If a minimum number of associations has been mobilized, then the client starts transmitting one packet per time-out period to maintain the associations. Field constraints limit the minimum value to 1 and the maximum to 255. These limits may be tuned for individual application needs. The ephemeral associations compete among themselves. As new ephemeral associations are mobilized, the client runs the mitigation algorithms described in Section 10 and Section 11.2 for the best candidates out of the population, the remaining ephemeral associations are timed out and demobilized. In this way the population includes only the best candidates that have most recently responded with an NTP packet to discipline the system clock. 4. Definitions A number of technical terms are defined in this section. A timescale is a frame of reference where time is expressed as the value of a monotonically increasing binary counter with an indefinite number of bits. It counts in seconds and fractions of a second, when a decimal point is employed. The Coordinated Universal Time (UTC) timescale is defined by ITU-R TF.460[6]. Under the auspices of the Metre Convention of 1865, in 1975 the CGPM[7] strongly endorsed the use of UTC as the basis for civil time. The Coordinated Universal Time (UTC) timescale represents mean solar time as disseminated by national standards laboratories. The system time is represented by the system clock maintained by the hardware and operating system. The goal of the NTP algorithms is to minimize both the time difference and frequency difference between UTC and the system clock. When these differences have been reduced below nominal tolerances, the system clock is said to be synchronized to UTC. The date of an event is the UTC time at which the event takes place. Dates are ephemeral values designated with upper case T. Running time is another timescale that is coincident to the synchronization function of the NTP program. Burbank, et al. Expires November 2, 2007 [Page 8] Internet-Draft NTPv4 Specification May 2007 A timestamp T(t) represents either the UTC date or time offset from UTC at running time t. Which meaning is intended should be clear from the context. Let T(t) be the time offset, R(t) the frequency offset, D(t) the aging rate (first derivative of R(t) with respect to t). Then, if T(t_0) is the UTC time offset determined at t = t_0, the UTC time offset at time t is T(t) = T(t_0) + R(t_0)(t-t_0) + 1/2 * D(t_0)(t-t_0)^2 + e, where e is a stochastic error term discussed later in this document. While the D(t) term is important when characterizing precision oscillators, it is ordinarily neglected for computer oscillators. In this document all time values are in seconds (s) and all frequency values are in seconds-per-second (s/s). It is sometimes convenient to express frequency offsets in parts-per-million (PPM), where 1 PPM is equal to 10^(-6) seconds/second. It is important in computer timekeeping applications to assess the performance of the timekeeping function. The NTP performance model includes four statistics which are updated each time a client makes a measurement with a server. The offset (theta) represents the maximum-likelihood time offset of the server clock relative to the system clock. The delay (delta) represents the round trip delay between the client and server. The dispersion (epsilon) represents the maximum error inherent in the measurement. It increases at a rate equal to the maximum disciplined system clock frequency tolerance (PHI), typically 15 PPM. The jitter (psi) is defined as the root-mean-square (RMS) average of the most recent offset differences, represents the nominal error in estimating the offset. While the theta, delta, epsilon, and psi statistics represent measurements of the system clock relative to each server clock separately, the NTP protocol includes mechanisms to combine the statistics of several servers to more accurately discipline and calibrate the system clock. The system offset (THETA) represents the maximum-likelihood offset estimate for the server population. The system jitter (PSI) represents the nominal error in estimating the system offset. The delta and epsilon statistics are accumulated at each stratum level from the reference clock to produce the root delay (DELTA) and root dispersion (EPSILON) statistics. The synchronization distance (LAMBDA) equal to EPSILON + DELTA / 2 represents the maximum error due all causes. The detailed formulations of these statistics are given in Section 11.2. They are available to the dependent applications in order to assess the performance of the synchronization function. Burbank, et al. Expires November 2, 2007 [Page 9] Internet-Draft NTPv4 Specification May 2007 5. Implementation Model Figure 2 shows the architecture of a typical, multi-threaded implementation. It includes two processes dedicated to each server, a peer process to receive messages from the server or reference clock and a poll process to transmit messages to the server or reference clock. ..................................................................... . Remote . Peer/Poll . System . Clock . . Servers . Processes . Process .Discipline. . . . . Process . .+--------+. +-----------+. +------------+ . . .| |->| |. | | . . .|Server 1| |Peer/Poll 1|->| | . . .| |<-| |. | | . . .+--------+. +-----------+. | | . . . . ^ . | | . . . . | . | | . . .+--------+. +-----------+. | | +-----------+. . .| |->| |. | Selection |->| |. +------+ . .|Server 2| |Peer/Poll 2|->| and | | Combine |->| Loop | . .| |<-| |. | Cluster | | Algorithm |. |Filter| . .+--------+. +-----------+. | Algorithms |->| |. +------+ . . . ^ . | | +-----------+. | . . . | . | | . | . .+--------+. +-----------+. | | . | . .| |->| |. | | . | . .|Server 3| |Peer/Poll 3|->| | . | . .| |<-| |. | | . | . .+--------+. +-----------+. +------------+ . | . ....................^.........................................|...... | . V . | . +-----+ . +--------------------------------------| VFO | . . +-----+ . . Clock . . Adjust . . Process . ............ Figure 2: Implementation Model These processes operate on a common data structure, called an association, which contains the statistics described above along with various other data described in Section 9. A client sends packets to one or more servers and then processes returned packets when they are received. The server interchanges source and destination addresses Burbank, et al. Expires November 2, 2007 [Page 10] Internet-Draft NTPv4 Specification May 2007 and ports, overwrites certain fields in the packet and returns it immediately (in the client/server mode) or at some time later (in the symmetric modes). As each NTP message is received, the offset theta between the peer clock and the system clock is computed along with the associated statistics delta, epsilon and psi. The system process includes the selection, cluster and combine algorithms that mitigate among the various servers and reference clocks to determine the most accurate and reliable candidates to synchronize the system clock. The selection algorithm uses Byzantine fault detection principles to discard the presumably incorrect candidates called "falsetickers" from the incident population, leaving only good candidates called "truechimers". A truechimer is a clock that maintains timekeeping accuracy to a previously published and trusted standard, while a falseticker is a clock that shows misleading or inconsistent time. The cluster algorithm uses statistical principles to find the most accurate set of truechimers. The combine algorithm computes the final clock offset by statistically averaging the surviving truechimers. The clock discipline process is a system process that controls the time and frequency of the system clock, here represented as a variable frequency oscillator (VFO). Timestamps struck from the VFO close the feedback loop which maintains the system clock time. Associated with the clock discipline process is the clock adjust process, which runs once each second to inject a computed time offset and maintain constant frequency. The RMS average of past time offset differences represents the nominal error or system clock jitter. The RMS average of past frequency offset differences represents the oscillator frequency stability or frequency wander. These terms are given precise interpretation in Section 11.3. A client sends messages to each server with a poll interval of 2^tau seconds, as determined by the poll exponent tau. In NTPv4, tau ranges from 4 (16 s) through 17 (36 h). The value of tau is determined by the clock discipline algorithm to match the loop time constant T_c = 2^tau. In client/server mode the server responds immediately; however, in symmetric modes each of two peers manages tau as a function of current system offset and system jitter, so may not agree with the same value. It is important that the dynamic behavior of the clock discipline algorithm be carefully controlled in order to maintain stability in the NTP subnet at large. This requires that the peers agree on a common tau equal to the minimum poll exponent of both peers. The NTP protocol includes provisions to properly negotiate this value. The implementation model includes some means to set and adjust the system clock. The operating system is assumed to provide two Burbank, et al. Expires November 2, 2007 [Page 11] Internet-Draft NTPv4 Specification May 2007 functions, one to set the time directly, for example the Unix settimeofday() function, and another to adjust the time in small increments advancing or retarding the time by a designated amount, for example the Unix adjtime() function. In this and following references, parentheses following a name indicate reference to a function rather than a simple variable. In the intended design the clock discipline process uses the adjtime() function if the adjustment is less than a designated threshold, and the settimeofday() function if above the threshold. The manner in which this is done and the value of the threshold as described in Section 10. 6. Data Types All NTP time values are represented in twos-complement format, with bits numbered in big-endian (as described in Appendix A of [13]) fashion from zero starting at the left, or high-order, position. There are three NTP time formats, a 128-bit date format, a 64-bit timestamp format and a 32-bit short format, as shown in Figure 3. The 128-bit date format is used where sufficient storage and word size are available. It includes a 64-bit signed seconds field spanning 584 billion years and a 64-bit fraction field resolving .05 attosecond (i.e., 0.5e-18). For convenience in mapping between formats, the seconds field is divided into a 32-bit Era Number field and a 32-bit Era Offset field. Eras cannot be produced by NTP directly, nor is there need to do so. When necessary, they can be derived from external means, such as the filesystem or dedicated hardware. Burbank, et al. Expires November 2, 2007 [Page 12] Internet-Draft NTPv4 Specification May 2007 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Seconds | Fraction | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NTP Short Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Seconds | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Fraction | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NTP Timestamp Format 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Era Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Era Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Fraction | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NTP Date Format Figure 3: NTP Time Formats The 64-bit timestamp format is used in packet headers and other places with limited word size. It includes a 32-bit unsigned seconds field spanning 136 years and a 32-bit fraction field resolving 232 picoseconds. The 32-bit short format is used in delay and dispersion header fields where the full resolution and range of the other formats are not justified. It includes a 16-bit unsigned seconds field and a 16-bit fraction field. In the date and timestamp formats the prime epoch, or base date of era 0, is 0 h 1 January 1900 UTC, when all bits are zero. It should be noted that strictly speaking, UTC did not exist prior to 1 January 1972, but it is convenient to assume it has existed for all eternity, even if all knowledge of historic leap seconds has been lost. Dates are relative to the prime epoch; values greater than zero represent Burbank, et al. Expires November 2, 2007 [Page 13] Internet-Draft NTPv4 Specification May 2007 times after that date; values less than zero represent times before it. Note that the Era Offset field of the date format and the Seconds field of the timestamp format have the same interpretation. Timestamps are unsigned values and operations on them produce a result in the same or adjacent eras. Era 0 includes dates from the prime epoch to some time in 2036, when the timestamp field wraps around and the base date for era 1 is established. In either format a value of zero is a special case representing unknown or unsynchronized time. Figure 4 shows a number of historic NTP dates together with their corresponding Modified Julian Day (MJD), NTP era and NTP timestamp. +-------------+------------+-----+---------------+------------------+ | Date | MJD | NTP | NTP Timestamp | Epoch | | | | Era | Era Offset | | +-------------+------------+-----+---------------+------------------+ | 1 Jan -4712 | -2,400,001 | -49 | 1,795,583,104 | 1st day Julian | | 1 Jan -1 | -679,306 | -14 | 139,775,744 | 2 BCE | | 1 Jan 0 | -678,491 | -14 | 171,311,744 | 1 BCE | | 1 Jan 1 | -678,575 | -14 | 202,939,144 | 1 CE | | 4 Oct 1582 | -100,851 | -3 | 2,873,647,488 | Last day Julian | | 15 Oct 1582 | -100,840 | -3 | 2,874,597,888 | First day | | | | | | Gregorian | | 31 Dec 1899 | 15019 | -1 | 4,294,880,896 | Last day NTP Era | | | | | | -1 | | 1 Jan 1900 | 15020 | 0 | 0 | First day NTP | | | | | | Era 0 | | 1 Jan 1970 | 40,587 | 0 | 2,208,988,800 | First day UNIX | | 1 Jan 1972 | 41,317 | 0 | 2,272,060,800 | First day UTC | | 31 Dec 1999 | 51,543 | 0 | 3,155,587,200 | Last day 20th | | | | | | Century | | 8 Feb 2036 | 64,731 | 1 | 63,104 | First day NTP | | | | | | Era 1 | +-------------+------------+-----+---------------+------------------+ Figure 4: Interesting Historic NTP Dates Let p be the number of significant bits in the second fraction. The clock resolution is defined 2^(-p), in seconds. In order to minimize bias and help make timestamps unpredictable to an intruder, the non- significant bits should be set to an unbiased random bit string. The clock precision is defined as the running time to read the system clock, in seconds. Note that the precision defined in this way can be larger or smaller than the resolution. The term rho, representing the precision used in the protocol, is the larger of the two. The only arithmetic operation permitted on dates and timestamps is Burbank, et al. Expires November 2, 2007 [Page 14] Internet-Draft NTPv4 Specification May 2007 twos-complement subtraction, yielding a 127-bit or 63-bit signed result. It is critical that the first-order differences between two dates preserve the full 128-bit precision and the first-order differences between two timestamps preserve the full 64-bit precision. However, the differences are ordinarily small compared to the seconds span, so they can be converted to floating double format for further processing and without compromising the precision. It is important to note that twos-complement arithmetic does not distinguish between signed and unsigned values (although comparisons can take sign into account); only the conditional branch instructions do. Thus, although the distinction is made between signed dates and unsigned timestamps, they are processed the same way. A perceived hazard with 64-bit timestamp calculations spanning an era, such as possible in 2036, might result in over-run. In point of fact, if the client is set within 68 years of the server before the protocol is started, correct values are obtained even if the client and server are in adjacent eras. Some time values are represented in exponent format, including the precision, time constant and poll interval. These are in 8-bit signed integer format in log2 (log base 2) seconds. The only arithmetic operations permitted on them are increment and decrement. For the purpose of this document and to simplify the presentation, a reference to one of these variables by name means the exponentiated value, e.g., the poll interval is 1024 s, while reference by name and exponent means the actual value, e.g., the poll exponent is 10. To convert system time in any format to NTP date and timestamp formats requires that the number of seconds s from the prime epoch to the system time be determined. To determine the integer era and timestamp given s, era = s / 2^(32) and timestamp = s - era * 2^(32), which works for positive and negative dates. To determine s given the era and timestamp, s = era * 2^(32) + timestamp. Converting between NTP and system time can be a little messy, and beyond the scope of this document. Note that the number of days in era 0 is one more than the number of days in most other eras and this won't happen again until the year 2400 in era 3. In the description of state variables to follow, explicit reference to integer type implies a 32-bit unsigned integer. This simplifies bounds checks, since only the upper limit needs to be defined. Burbank, et al. Expires November 2, 2007 [Page 15] Internet-Draft NTPv4 Specification May 2007 Without explicit reference, the default type is 64-bit floating double. Exceptions will be noted as necessary. 7. Data Structures The NTP protocol state machines described in the following sections are defined using state variables and code fragments defined in Appendix A. State variables are separated into classes according to their function in packet headers, peer and poll processes, the system process and the clock discipline process. Packet variables represent the NTP header values in transmitted and received packets. Peer and poll variables represent the contents of the association for each server separately. System variables represent the state of the server as seen by its dependent clients. Clock discipline variables represent the internal workings of the clock discipline algorithm. Additional parameters and variable classes are defined in Appendix A. 7.1. Structure Conventions In order to distinguish between different variables of the same name but used in different processes, the naming convention summarized in Figure 5 is adopted. A receive packet variable v is a member of the packet structure r with fully qualified name r.v. In a similar manner x.v is a transmit packet variable, p.v is a peer variable, s.v is a system variable and c.v is a clock discipline variable. There is a set of peer variables for each association; there is only one set of system and clock variables. +------+---------------------------------+ | Name | Description | +------+---------------------------------+ | r. | receive packet header variable | | x. | transmit packet header variable | | p. | peer/poll variable | | s. | system variable | | c. | clock discipline variable | +------+---------------------------------+ Figure 5: Prefix Conventions 7.2. Global Parameters In addition to the variable classes a number of global parameters are defined in this document, including those shown with values in Figure 6. Burbank, et al. Expires November 2, 2007 [Page 16] Internet-Draft NTPv4 Specification May 2007 +-----------+-------+----------------------------------+ | Name | Value | Description | +-----------+-------+----------------------------------+ | PORT | 123 | NTP port number | | VERSION | 4 | version number | | TOLERANCE | 15e-6 | frequency tolerance PHI (s/s) | | MINPOLL | 4 | minimum poll exponent (16 s) | | MAXPOLL | 17 | maximum poll exponent (36 h) | | MAXDISP | 16 | maximum dispersion (16 s) | | MINDISP | .005 | minimum dispersion increment (s) | | MAXDIST | 1 | distance threshold (1 s) | | MAXSTRAT | 16 | maximum stratum number | +-----------+-------+----------------------------------+ Figure 6: Global Parameters While these are the only global parameters needed in this document, a larger collection is necessary in the skeleton and larger still for any implementation. Appendix A.1.1 contains those used by the skeleton for the mitigation algorithms, clock discipline algorithm and related implementation-dependent functions. Some of these parameter values are cast in stone, like the NTP port number assigned by the IANA and the version number assigned NTPv4 itself. Others like the frequency tolerance (also called PHI), involve an assumption about the worst case behavior of a system clock once synchronized and then allowed to drift when its sources have become unreachable. The minimum and maximum parameters define the limits of state variables as described in later sections of this document. While shown with fixed values in this document, some implementations may make them variables adjustable by configuration commands. For instance, the reference implementation computes the value of PRECISION as log2 of the minimum time in several iterations to read the system clock. 7.3. Packet Header Variables The most important state variables from an external point of view are the packet header variables described in Figure 7 and below. The NTP packet header consists of an integral number of 32-bit (4 octet) words in network byte order. The packet format consists of three components, the header itself, one or more optional extension fields and an optional message authentication code (MAC). The header component is identical to the NTPv3 header and previous versions. The optional extension fields are used by the Autokey public key cryptographic algorithms described in [3]. The optional MAC is used by both Autokey and the symmetric key cryptographic algorithm described in this report. Burbank, et al. Expires November 2, 2007 [Page 17] Internet-Draft NTPv4 Specification May 2007 +-----------+------------+-----------------------+ | Name | Formula | Description | +-----------+------------+-----------------------+ | leap | leap | leap indicator (LI) | | version | version | version number (VN) | | mode | mode | mode | | stratum | stratum | stratum | | poll | poll | poll exponent | | precision | rho | precision exponent | | rootdelay | delta_r | root delay | | rootdisp | epsilon_r | root dispersion | | refid | refid | reference ID | | reftime | reftime | reference timestamp | | org | T1 | origin timestamp | | rec | T2 | receive timestamp | | xmt | T3 | transmit timestamp | | dst | T4 | destination timestamp | | keyid | keyid | key ID | | MAC | MAC | message digest | +-----------+------------+-----------------------+ Figure 7: Packet Header Variables The NTP packet is a UDP datagram[14]. Some fields use multiple words and others are packed in smaller fields within a word. The NTP packet header shown in Figure 8 has 12 words followed by optional extension fields and finally an optional message authentication code (MAC) consisting of the key identifier field and message digest field. Burbank, et al. Expires November 2, 2007 [Page 18] Internet-Draft NTPv4 Specification May 2007 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |LI | VN |Mode | Stratum | Poll | Precision | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Root Delay | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Root Dispersion | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reference ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Reference Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Origin Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Receive Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Transmit Timestamp (64) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Extension Field 1 (variable) . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Extension Field 2 (variable) . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Key Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | MAC (128) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8: Packet Header Format Burbank, et al. Expires November 2, 2007 [Page 19] Internet-Draft NTPv4 Specification May 2007 The extension fields are used to add optional capabilities, for example, the Autokey security protocol [3]. The extension field format is presented in order that the packet can be parsed without knowledge of the extension field functions. The MAC is used by both Autokey and the symmetric key authentication scheme described in Appendix A. A list of the packet header variables is shown in Figure 7 and described in detail below. Except for a minor variation when using the IPv6 address family, these fields are backwards compatible with NTPv3. The packet header fields apply to both transmitted packets (x prefix) and received packets (r prefix). In Figure 8 the size of some multiple-word fields is shown in bits if not the default 32 bits. The basic header extends from the beginning of the packet to the end of the Transmit Timestamp field. The fields and associated packet variables (in parentheses) are interpreted as follows: LI Leap Indicator (leap): 2-bit integer warning of an impending leap second to be inserted or deleted in the last minute of the current month with values defined in Figure 9. +-------+-------------------------------------------------+ | Value | Meaning | +-------+-------------------------------------------------+ | 0 | no warning | | 1 | last minute of the day has 61 seconds | | 2 | last minute of the day has 59 seconds | | 3 | alarm condition (the clock is not synchronized) | +-------+-------------------------------------------------+ Figure 9: Leap Indicator VN Version Number (version): 3-bit integer representing the NTP version number, currently 4. Mode (mode): 3-bit integer representing the mode, with values defined in Figure 10. Burbank, et al. Expires November 2, 2007 [Page 20] Internet-Draft NTPv4 Specification May 2007 +-------+--------------------------+ | Value | Meaning | +-------+--------------------------+ | 0 | reserved | | 1 | symmetric active | | 2 | symmetric passive | | 3 | client | | 4 | server | | 5 | broadcast | | 6 | NTP control message | | 7 | reserved for private use | +-------+--------------------------+ Figure 10: Association Modes Stratum (stratum): 8-bit integer representing the stratum, with values defined in Figure 11. +--------+-----------------------------------------------------+ | Value | Meaning | +--------+-----------------------------------------------------+ | 0 | unspecified or invalid | | 1 | primary server (e.g., equipped with a GPS receiver) | | 2-15 | secondary server (via NTP) | | 16 | unsynchronized | | 17-255 | reserved | +--------+-----------------------------------------------------+ Figure 11: Packet Stratum It is customary to map the stratum value 0 in received packets to MAXSTRAT (16) in the peer variable p.stratum and to map p.stratum values of MAXSTRAT or greater to 0 in transmitted packets. This allows reference clocks, which normally appear at stratum 0, to be conveniently mitigated using the same algorithms used for external sources (See Appendix A.5.5.1). Poll: 8-bit signed integer representing the maximum interval between successive messages, in log2 seconds. Suggested default limits for minimum and maximum poll intervals are 6 and 10, respectively. Precision: 8-bit signed integer representing the precision of the system clock, in log2 seconds. For instance a value of -18 corresponds to a precision of about one microsecond. The precision can be determined when the service first starts up as the minimum time of several iterations to read the system clock. Root Delay (rootdelay): Total round trip delay to the reference Burbank, et al. Expires November 2, 2007 [Page 21] Internet-Draft NTPv4 Specification May 2007 clock, in NTP short format. Root Dispersion (rootdisp): Total dispersion to the reference clock, in NTP short format. Reference ID (refid): 32-bit code identifying the particular server or reference clock. The interpretation depends on the value in the stratum field. For packet stratum 0 (unspecified or invalid) this is a four-character ASCII[15] string, called the kiss code, used for debugging and monitoring purposes. For stratum 1 (reference clock) this is a four-octet, left-justified, zero-padded ASCII string assigned to the reference clock. While not specifically enumerated in this document, the identifiers in Figure 12 have been used as ASCII identifiers: +------+----------------------------------------------------------+ | ID | Clock Source | +------+----------------------------------------------------------+ | GOES | Geosynchronous Orbit Environment Satellite | | GPS | Global Position System | | GAL | Galileo Positioning System | | PPS | Generic pulse-per-second | | IRIG | Inter-Range Instrumentation Group | | WWVB | LF Radio WWVB Ft. Collins, CO 60 kHz | | DCF | LF Radio DCF77 Mainflingen, DE 77.5 kHz | | HBG | LF Radio HBG Prangins, HB 75 kHz | | MSF | LF Radio MSF Anthorn, UK 60 kHz | | JJY | LF Radio JJY Fukushima, JP 40 kHz, Saga, JP 60 kHz | | LORC | MF Radio LORAN C station, 100 kHz | | TDF | MF Radio Allouis, FR 162 kHz | | CHU | HF Radio CHU Ottawa, Ontario | | WWV | HF Radio WWV Ft. Collins, CO | | WWVH | HF Radio WWVH Kauai, HI | | NIST | NIST telephone modem | | ACTS | NIST telephone modem | | USNO | USNO telephone modem | | PTB | European telephone modem | +------+----------------------------------------------------------+ Figure 12: Reference Identifiers Above stratum 1 (secondary servers and clients) this is the reference identifier of the server and can be used to detect timing loops. If using the IPv4 address family, the identifier is the four-octet IPv4 address. If using the IPv6 address family, it is the first four octets of the MD5 hash of the IPv6 address. Note that, when using the IPv6 address family on an NTPv4 server with a NTPv3 client, the Reference Identifier field appears to be a random value and a timing Burbank, et al. Expires November 2, 2007 [Page 22] Internet-Draft NTPv4 Specification May 2007 loop might not be detected. Reference Timestamp: Time when the system clock was last set or corrected, in NTP timestamp format. Origin Timestamp (org): Time at the client when the request departed for the server, in NTP timestamp format. Receive Timestamp (rec): Time at the server when the request arrived from the client, in NTP timestamp format. Transmit Timestamp (xmt): Time at the server when the response left for the client, in NTP timestamp format. Destination Timestamp (dst): Time at the client when the reply arrived from the server, in NTP timestamp format. Note: Destination Timestamp field is not included as a header field; it is determined upon arrival of the packet and made available in the packet buffer data structure. The MAC consists of the Key Identifier followed by the Message Digest. The message digest, or cryptosum, is calculated as in [16] over all NTP header and optional extension fields, but not the MAC itself. Extension Field n: See Section 7.5 for a description of the format of this field. Key Identifier (keyid): 32-bit unsigned integer used by the client and server to designate a secret 128-bit MD5 key. Message Digest (digest): 128-bit bitstring computed by the keyed MD5 message digest computed over all the words in the header and extension fields, but not the MAC itself. 7.4. The Kiss-o'-Death Packet If the Stratum field is 0, which implies unspecified or invalid, the Reference Identifier field can be used to convey messages useful for status reporting and access control. These are called Kiss-o'-Death (KoD) packets and the ASCII messages they convey are called kiss codes. The KoD packets got their name because an early use was to tell clients to stop sending packets that violate server access controls. The kiss codes can provide useful information for an intelligent client, either NTPv4 or SNTPv4. Kiss codes are encoded in four-character ASCII strings left justified and zero filled. The strings are designed for character displays and log files. A list of Burbank, et al. Expires November 2, 2007 [Page 23] Internet-Draft NTPv4 Specification May 2007 the currently-defined kiss codes is given in Figure 13. Recipients of kiss codes MUST inspect them and in the following cases take these actions: a. For kiss codes: DENY, RSTR the client MUST demobilize any associations to that server and stop sending packets to that server; b. For kiss code: RATE the client MUST immediately reduce its polling interval to that server and continue to reduce it each time it receives a RATE kiss code. c. Other than the above conditions, KoD packets have no protocol significance and are discarded after inspection. +------+------------------------------------------------------------+ | Code | Meaning | +------+------------------------------------------------------------+ | ACST | The association belongs to a unicast server | | AUTH | Server authentication failed | | AUTO | Autokey sequence failed | | BCST | The association belongs to a broadcast server | | CRYP | Cryptographic authentication or identification failed | | DENY | Access denied by remote server | | DROP | Lost peer in symmetric mode | | RSTR | Access denied due to local policy | | INIT | The association has not yet synchronized for the first | | | time | | MCST | The association belongs to a dynamically discovered server | | NKEY | No key found. Either the key was never installed or is | | | not trusted | | RATE | Rate exceeded. The server has temporarily denied access | | | because the client exceeded the rate threshold | | RMOT | Alteration of association from a remote host running | | | ntpdc. | | STEP | A step change in system time has occurred, but the | | | association has not yet resynchronized | +------+------------------------------------------------------------+ Figure 13: Kiss Codes 7.5. NTP Extension Field Format In NTPv4 one or more extension fields can be inserted after the header and before the MAC, which is always present when an extension field is present. Other than defining the field format, this document makes no use of the field contents. An extension field Burbank, et al. Expires November 2, 2007 [Page 24] Internet-Draft NTPv4 Specification May 2007 contains a request or response message in the format shown in Figure 14. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Field Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . . Value . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Padding (as needed) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 14: Extension Field Format All extension fields are zero-padded to a word (4 octets) boundary. The Field Type field is specific to the defined function and is not elaborated here. While the minimum field length containing required fields is 4 words (16 octets), a maximum field length remains to be established. The Length field is a 16-bit unsigned integer which indicates the length of the entire extension field in octets, including the Padding field. 8. On Wire Protocol The heart of the NTP on-wire protocol is the core mechanism which exchanges time values between servers, peers and clients. It is inherently resistant to lost or duplicate packets. Data integrity is provided by the IP and UDP checksums. No flow control or retransmission facilities are provided or necessary. The protocol uses timestamps, either extracted from packet headers or struck from the system clock upon the arrival or departure of a packet. Timestamps are precision data and should be restruck in case of link level retransmission and corrected for the time to compute a MAC on transmit. NTP messages make use of two different communication modes, one-to- one and one-to-many, commonly referred to as unicast and broadcast. For the purposes of this document, the term broadcast is interpreted as any available one-to-many mechanism. For IPv4 this equates to either IPv4 broadcast or IPv4 multicast. For IPv6 this equates to IPv6 multicast. For this purpose, IANA has allocated the IPv4 multicast address 224.0.1.1 and the IPv6 multicast address ending Burbank, et al. Expires November 2, 2007 [Page 25] Internet-Draft NTPv4 Specification May 2007 :101, with prefix determined by scoping rules. The on-wire protocol uses four timestamps numbered t1 through t4 and three state variables org, rec and xmt, as shown in Figure 15. This figure shows the most general case where each of two peers, A and B, independently measure the offset and delay relative to the other. For purposes of illustration the packet timestamps are shown in lower case, while the state variables are shown in upper case. The state variables are copied from the packet timestamps upon arrival or departure of a packet. Burbank, et al. Expires November 2, 2007 [Page 26] Internet-Draft NTPv4 Specification May 2007 t2 t3 t6 t7 +---------+ +---------+ +---------+ +---------+ | 0 | | t1 | | t3 | | t5 | +---------+ +---------+ +---------+ +---------+ | 0 | | t2 | | t4 | | t6 | Packet +---------+ +---------+ +---------+ +---------+ Timestamps | t1 | |t3=clock | | t5 | |t7=clock | +---------+ +---------+ +---------+ +---------+ |t2=clock | |t6=clock | +---------+ +---------+ Peer B +---------+ +---------+ +---------+ +---------+ org | T1 | | T1 | | t5<>T1? | | T5 | +---------+ +---------+ +---------+ +---------+ State rec | T2 | | T2 | | T6 | | T6 | Variables +---------+ +---------+ +---------+ +---------+ xmt | 0 | | T3 | | t3=T3? | | T7 | +---------+ +---------+ +---------+ +---------+ t2 t3 t6 t7 --------------------------------------------------------- /\ \ /\ \ / \ / \ / \ / \ / \/ / \/ --------------------------------------------------------- t1 t4 t5 t8 t1 t4 t5 t8 +---------+ +---------+ +---------+ +---------+ | 0 | | t1 | | t3 | | t5 | +---------+ +---------+ +---------+ +---------+ | 0 | | t2 | | t4 | | t6 | Packet +---------+ +---------+ +---------+ +---------+ Timestamps |t1=clock | | t3 | |t5=clock | | t7 | +---------+ +---------+ +---------+ +---------+ |t4=clock | |t8=clock | +---------+ +---------+ Peer A +---------+ +---------+ +---------+ +---------+ org | 0 | | t3<>0? | | T3 | | t7<>T3? | +---------+ +---------+ +---------+ +---------+ State rec | 0 | | T4 | | T4 | | T8 | Variables +---------+ +---------+ +---------+ +---------+ xmt | T1 | | t1=T1? | | T5 | | t5=T5? | +---------+ +---------+ +---------+ +---------+ Burbank, et al. Expires November 2, 2007 [Page 27] Internet-Draft NTPv4 Specification May 2007 Figure 15: On-Wire Protocol In the figure the first packet transmitted by A contains only the origin timestamp t1, which is then copied to T1. B receives the packet at t2 and copies t1 to T1 and the receive timestamp t2 to T2. At this time or some time later at t3, B sends a packet to A containing t1 and t2 and in addition the transmit timestamp t3. All three timestamps are copied to the corresponding state variables. A receives the packet at t4 containing the three timestamps t1, t2 and t3 and in addition the destination timestamp t4. These four timestamps are used to compute the offset and delay of B relative to A, as described below. Before the xmt and org state variables are updated, two sanity checks are performed in order to protect against duplicate, bogus or replayed packets. In the exchange above, a packet is duplicate or replay if the transmit timestamp t3 in the packet matches the org state variable T3. A packet is bogus if the origin timestamp t1 in the packet does not match the xmt state variable T1. In either of these cases the state variables are updated, then the packet is discarded. To protect against replay of the last transmitted packet, the xmt state variable is set to zero immediately after a successful bogus check. The four most recent timestamps, T1 through T4, are used to compute the offset of B relative to A theta = T(B) - T(A) = 1/2 * [(T2-T1) + (T3-T4)] and the round trip delay delta = T(ABA) = (T4-T1) - (T3-T2). Note that the quantities within parentheses are computed from 64-bit unsigned timestamps and result in signed values with 63 significant bits plus sign. These values can represent dates from 68 years in the past to 68 years in the future. However, the offset and delay are computed as sums and differences of these values, which contain 62 significant bits and two sign bits, so can represent unambiguous values from 34 years in the past to 34 years in the future. In other words, the time of the client must be set within 34 years of the server before the service is started. This is a fundamental limitation with 64-bit integer arithmetic. In implementations where floating double arithmetic is available, the first-order differences can be converted to floating double and the second-order sums and differences computed in that arithmetic. Since the second-order terms are typically very small relative to the Burbank, et al. Expires November 2, 2007 [Page 28] Internet-Draft NTPv4 Specification May 2007 timestamp magnitudes, there is no loss in significance, yet the unambiguous range is restored from 34 years to 68 years. In some scenarios where the initial frequency offset of the client is relatively large and the actual propagation time small, it is possible for the delay computation to becomes negative. For instance, if the frequency difference is 100 PPM and the interval T4-T1 is 64 s, the apparent delay is -6.4 ms. Since negative values are misleading in subsequent computations, the value of delta should be clamped not less than s.rho, where s.rho is the system precision described in Section 11.1, expressed in seconds. The discussion above assumes the most general case where two symmetric peers independently measure the offsets and delays between them. In the case of a stateless server, the protocol can be simplified. A stateless server copies T3 and T4 from the client packet to T1 and T2 of the server packet and tacks on the transmit timestamp T3 before sending it to the client. Additional details for filling in the remaining protocol fields are given in a Section 9 and following sections and in the appendix. Note that the on-wire protocol as described resists replay of a server response packet. However, it does not resist replay of the client request packet, which would result in a server reply packet with new values of T2 and T3 and result in incorrect offset and delay. This vulnerability can be avoided by setting the xmt state variable to zero after computing the offset and delay. 9. Peer Process The process descriptions to follow include a listing of the important state variables followed by an overview of the process operations implemented as routines. Frequent reference is made to the skeleton in the appendix. The skeleton includes C-language fragments that describe the functions in more detail. It includes the parameters, variables and declarations necessary for a conforming NTPv4 implementation. However, many additional variables and routines may be necessary in a working implementation. The peer process is called upon arrival of a server or peer packet. It runs the on-wire protocol to determine the clock offset and round trip delay and in addition computes statistics used by the system and poll processes. Peer variables are instantiated in the association data structure when the structure is initialized and updated by arriving packets. There is a peer process, poll process and association for each server. Burbank, et al. Expires November 2, 2007 [Page 29] Internet-Draft NTPv4 Specification May 2007 9.1. Peer Process Variables Figure 16, Figure 17, Figure 18 and Figure 19 summarize the common names, formula names and a short description of the peer variables. The common names and formula names are interchangeable; formula names are intended to increase readability of equations in this specification. Unless noted otherwise, all peer variables have assumed prefix p. +---------+----------+-----------------------+ | Name | Formula | Description | +---------+----------+-----------------------+ | srcaddr | srcaddr | source address | | srcport | srcport | source port | | dstaddr | dstaddr | destination address | | dstport | destport | destination port | | keyid | keyid | key identifier key ID | +---------+----------+-----------------------+ Figure 16: Peer Process Configuration Variables +-----------+------------+---------------------+ | Name | Formula | Description | +-----------+------------+---------------------+ | leap | leap | leap indicator | | version | version | version number | | mode | mode | mode | | stratum | stratum | stratum | | ppoll | ppoll | peer poll exponent | | rootdelay | delta_r | root delay | | rootdisp | epsilon_r | root dispersion | | refid | refid | reference ID | | reftime | reftime | reference timestamp | +-----------+------------+---------------------+ Figure 17: Peer Process Packet Variables +------+---------+--------------------+ | Name | Formula | Description | +------+---------+--------------------+ | org | T1 | origin timestamp | | rec | T2 | receive timestamp | | xmt | T3 | transmit timestamp | | t | t | packet time | +------+---------+--------------------+ Burbank, et al. Expires November 2, 2007 [Page 30] Internet-Draft NTPv4 Specification May 2007 Figure 18: Peer Process Timestamp Variables +--------+---------+-----------------+ | Name | Formula | Description | +--------+---------+-----------------+ | offset | theta | clock offset | | delay | delta | roundtrip delay | | disp | epsilon | dispersion | | jitter | psi | jitter | | filter | filter | clock filter | | tp | t_p | filter time | +--------+---------+-----------------+ Figure 19: Peer Process Statistics Variables The following configuration variables are normally initialized when the association is mobilized, either from a configuration file or upon the arrival of the first packet for an unknown association. srcaddr: IP address of the remote server or reference clock. This becomes the destination IP address in packets sent from this association. srcport: UDP port number of the server or reference clock. This becomes the destination port number in packets sent from this association. When operating in symmetric modes (1 and 2) this field must contain the NTP port number PORT (123) assigned by the IANA. In other modes it can contain any number consistent with local policy. dstaddr: IP address of the client. This becomes the source IP address in packets sent from this association. dstport: UDP port number of the client, ordinarily the NTP port number PORT (123) assigned by the IANA. This becomes the source port number in packets sent from this association. keyid: Symmetric key ID for the 128-bit MD5 key used to generate and verify the MAC. The client and server or peer can use different values, but they must map to the same key. The variables defined in Figure 17 are updated from the packet header as each packet arrives. They are interpreted in the same way as the packet variables of the same names. It is convenient for later processing to convert the NTP short format packet values r.rootdelay and r.rootdisp to floating doubles as peer variables. The variables defined in Figure 18 include the timestamps exchanged Burbank, et al. Expires November 2, 2007 [Page 31] Internet-Draft NTPv4 Specification May 2007 by the on-wire protocol in Section 8. The t variable is the seconds counter c.t associated with these values. The c.t variable is maintained by the clock adjust process described in Section 12. It counts the seconds since the service was started. The variables defined in Figure 19 include the statistics computed by the clock_filter() routine described in Section 10. The tp variable is the seconds counter associated with these values. 9.2. Peer Process Operations The receive() routine in Appendix A.5.1 shows the peer process code flow upon the arrival of a packet. The access() routine in Appendix A.5.4 implements access restrictions using an access control list (ACL). There is no specific method required for access control, although it is recommended that implementations include such a scheme, which is similar to many others now in widespread use. Format checks require correct field length and alignment, acceptable version number (1-4) and correct extension field syntax, if present. There is no specific requirement for authentication; however, if authentication is implemented, the symmetric key scheme described in Appendix A.2 must be among the supported schemes. This scheme uses the MD5 keyed hash algorithm described in [16]. Next, the association table is searched for matching source address and source port using the find_assoc() routine in Appendix A.5.1. Figure 20 is a dispatch table where the columns correspond to the packet mode and rows correspond to the association mode. The intersection of the association and packet modes dispatches processing to one of the following steps. +------------------+---------------------------------------+ | | Packet Mode | +------------------+-------+-------+-------+-------+-------+ | Association Mode | 1 | 2 | 3 | 4 | 5 | +------------------+-------+-------+-------+-------+-------+ | No Association 0 | NEWPS | DSCRD | FXMIT | MANY | NEWBC | | Symm. Active 1 | PROC | PROC | DSCRD | DSCRD | DSCRD | | Symm. Passive 2 | PROC | ERR | DSCRD | DSCRD | DSCRD | | Client 3 | DSCRD | DSCRD | DSCRD | PROC | DSCRD | | Server 4 | DSCRD | DSCRD | DSCRD | DSCRD | DSCRD | | Broadcast 5 | DSCRD | DSCRD | DSCRD | DSCRD | DSCRD | | Bcast Client 6 | DSCRD | DSCRD | DSCRD | DSCRD | PROC | +------------------+-------+-------+-------+-------+-------+ Figure 20: Peer Dispatch Table DSCRD. This indicates a nonfatal violation of protocol as the result Burbank, et al. Expires November 2, 2007 [Page 32] Internet-Draft NTPv4 Specification May 2007 of a programming error, long delayed packet or replayed packet. The peer process discards the packet and exits. ERR. This indicates a fatal violation of protocol as the result of a programming error, long delayed packet or replayed packet. The peer process discards the packet, demobilizes the symmetric passive association and exits. FXMIT. This indicates a client (mode 3) packet matching no association (mode 0). If the destination address is not a broadcast address, the server constructs a server (mode 4) packet and returns it to the client without retaining state. The server packet header is constructed by the fast_xmit() routine in Appendix A.5.3. The packet header is assembled from the receive packet and system variables as shown in Figure 21. If the s.rootdelay and s.rootdisp system variables are stored in floating double, they must be converted to NTP short format first. +-----------------------------------+ | Packet Variable --> Variable | +-----------------------------------+ | r.leap --> p.leap | | r.mode --> p.mode | | r.stratum --> p.stratum | | r.poll --> p.ppoll | | r.rootdelay --> p.rootdelay | | r.rootdisp --> p.rootdisp | | r.refid --> p.refid | | r.reftime --> p.reftime | | r.keyid --> p.keyid | +-----------------------------------+ Figure 21: Receive Packet Header Note that, if authentication fails, the server returns a special message called a crypto-NAK. This message includes the normal NTP header data shown in Figure 8, but with a MAC consisting of four octets of zeros. The client MAY accept or reject the data in the message. After these actions the peer process exits. If the destination address is a multicast address, the sender is operating in manycast client mode. If the packet is valid and the server stratum is less than the client stratum, the server sends an ordinary server (mode 4) packet, but using its unicast destination address. A crypto-NAK is not sent if authentication fails. After these actions the peer process exits. MANY: This indicates a server (mode 4) packet matching no Burbank, et al. Expires November 2, 2007 [Page 33] Internet-Draft NTPv4 Specification May 2007 association. Ordinarily, this can happen only as the result of a manycast server reply to a previously sent multicast client packet. If the packet is valid, an ordinary client (mode 3) association is mobilized and operation continues as if the association was mobilized by the configuration file. NEWBC. This indicates a broadcast (mode 5) packet matching no association. The client mobilizes either a client (mode 3) or broadcast client (mode 6) association as shown in the mobilize() and clear() routines in Appendix A.2. Then the packet() routine in Appendix A.5.1.1 validates the packet and initializes the peer variables. If the implementation supports no additional security or calibration functions, the association mode is set to broadcast client (mode 6) and the peer process exits. Implementations supporting public key authentication MAY run the Autokey or equivalent security protocol. Implementations SHOULD set the association mode to 3 and run a short client/server exchange to determine the propagation delay. Following the exchange the association mode is set to 6 and the peer process continues in listen-only mode. Note the distinction between a mode-6 packet, which is reserved for the NTP monitor and control functions, and a mode-6 association. NEWPS. This indicates a symmetric active (mode 1) packet matching no association. The client mobilizes a symmetric passive (mode 2) association as shown in the mobilize() routine and clear() routines in Appendix A.2. Processing continues in the PROC section below. PROC. This indicates a packet matching an existing association. The packet timestamps are carefully checked to avoid invalid, duplicate or bogus packets. Additional checks are summarized in Figure 22. Note that all packets, including a crypto-NAK, are considered valid only if they survive these tests. Burbank, et al. Expires November 2, 2007 [Page 34] Internet-Draft NTPv4 Specification May 2007 +--------------------------+----------------------------------------+ | Packet Type | Description | +--------------------------+----------------------------------------+ | 1 duplicate packet | The packet is at best an old duplicate | | | or at worst a replay by a hacker. | | | This can happen in symmetric modes if | | | the poll intervals are uneven. | | 2 bogus packet | | | 3 invalid | One or more timestamp fields are | | | invalid. This normally happens in | | | symmetric modes when one peer sends | | | the first packet to the other and | | | before the other has received its | | | first reply. | | 4 access denied | The access controls have blacklisted | | | the source. | | 5 authentication failure | The cryptographic message digest does | | | not match the MAC. | | 6 unsynchronized | The server is not synchronized to a | | | valid source. | | 7 bad header data | One or more header fields are invalid. | +--------------------------+----------------------------------------+ Figure 22: Packet Error Checks Processing continues in the packet() routine in Appendix A.5.1.1. It copies the packet variables to the peer variables as shown in Figure 21 and the packet() routine in Appendix A.5.2. The receive() routine implements tests 1-5 in Figure 22; the packet() routine implements tests 6-7. If errors are found the packet is discarded and the peer process exits. The on-wire protocol calculates the clock offset theta and round trip delay delta from the four most recent timestamps as described in Section 8. While it is in principle possible to do all calculations except the first-order timestamp differences in fixed-point arithmetic, it is much easier to convert the first-order differences to floating doubles and do the remaining calculations in that arithmetic, and this will be assumed in the following description. Next, the 8-bit p.reach shift register in the poll process described in Section 13 is used to determine whether the server is reachable and the data are fresh. The register is shifted left by one bit when a packet is sent and the rightmost bit is set to zero. As valid packets arrive, the packet() routine sets the rightmost bit to one. If the register contains any nonzero bits, the server is considered reachable; otherwise, it is unreachable. Since the peer poll interval might have changed since the last packet, the poll_update() Burbank, et al. Expires November 2, 2007 [Page 35] Internet-Draft NTPv4 Specification May 2007 routine in Appendix A.5.7.2 is called to redetermine the host poll interval. The dispersion statistic epsilon(t) represents the maximum error due to the frequency tolerance and time since the last packet was sent. It is initialized epsilon(t_0) = r.rho + s.rho + PHI * (T4-T1) when the measurement is made at t_0 according to the seconds counter. Here r.rho is the packet precision described in Section 7.3 and s.rho is the system precision described in Section 11.1, both expressed in seconds. These terms are necessary to account for the uncertainty in reading the system clock in both the server and the client. The dispersion then grows at constant rate PHI; in other words, at time t, epsilon(t) = epsilon(t_0) + PHI * (t-t_0). With the default value PHI = 15 PPM, this amounts to about 1.3 s per day. With this understanding, the argument t will be dropped and the dispersion represented simply as epsilon. The remaining statistics are computed by the clock filter algorithm described in the next section. 10. Clock Filter Algorithm The clock filter algorithm, part of the peer process, is implemented in the clock_filter() routine of Appendix A.5.2. It grooms the stream of on-wire data to select the samples most likely to represent accurate time. The algorithm produces the variables shown in Figure 19, including the offset (theta), delay (delta), dispersion (epsilon), jitter (psi) and time of arrival (t). These data are used by the mitigation algorithms to determine the best and final offset used to discipline the system clock. They are also used to determine the server health and whether it is suitable for synchronization. The clock filter algorithm saves the most recent sample tuples (theta, delta, epsilon, t) in the filter structure, which functions as an 8-stage shift register. The tuples are saved in the order that packets arrive. Here t is the packet time of arrival according to the seconds counter and should not be confused with the peer variable tp. The following scheme is used to insure sufficient samples are in the filter and that old stale data are discarded. Initially, the tuples of all stages are set to the dummy tuple (0, MAXDISP, MAXDISP, 0). As valid packets arrive, tuples are shifted into the filter causing old tuples to be discarded, so eventually only valid tuples remain. If the three low order bits of the reach register are zero, Burbank, et al. Expires November 2, 2007 [Page 36] Internet-Draft NTPv4 Specification May 2007 indicating three poll intervals have expired with no valid packets received, the poll process calls the clock filter algorithm with a dummy tuple just as if the tuple had arrived from the network. If this persists for eight poll intervals, the register returns to the initial condition. In the next step the shift register stages are copied to a temporary list and the list sorted by increasing delta. Let i index the stages starting with the lowest delta. If the first tuple epoch t_0 is not later than the last valid sample epoch tp, the routine exits without affecting the current peer variables. Otherwise, let epsilon_i be the dispersion of the ith entry, then i=n-1 --- epsilon_i epsilon = \ ---------- / (i+1) --- 2 i=0 is the peer dispersion p.disp. Note the overload of epsilon, whether input to the clock filter or output, the meaning should be clear from context. The observer should note (a) if all stages contain the dummy tuple with dispersion MAXDISP, the computed dispersion is a little less than 16 s, (b) each time a valid tuple is shifted into the register, the dispersion drops by a little less than half, depending on the valid tuples dispersion, (c) after the fourth valid packet the dispersion is usually a little less than 1 s, which is the assumed value of the MAXDIST parameter used by the selection algorithm to determine whether the peer variables are acceptable or not. Let the first stage offset in the sorted list be theta_0; then, for the other stages in any order, the jitter is the RMS average +----- -----+^1/2 | n-1 | | --- | 1 | \ 2 | psi = -------- * | / (theta_0-theta_j) | (n-1) | --- | | j=1 | +----- -----+ where n is the number of valid tuples in the filter (n > 1). In order to insure consistency and avoid divide exceptions in other computations, the psi is bounded from below by the system precision s.rho expressed in seconds. While not in general considered a major factor in ranking server quality, jitter is a valuable indicator of Burbank, et al. Expires November 2, 2007 [Page 37] Internet-Draft NTPv4 Specification May 2007 fundamental timekeeping performance and network congestion state. Of particular importance to the mitigation algorithms is the peer synchronization distance, which is computed from the delay and dispersion. lambda = (delta / 2) + epsilon. Note that epsilon and therefore lambda increase at rate PHI. The lambda is not a state variable, since lambda is recalculated at each use. It is a component of the root synchronization distance used by the mitigation algorithms as a metric to evaluate the quality of time available from each server. It is important to note that, unlike NTPv3, NTPv4 associations do not show a timeout condition by setting the stratum to 16 and leap indicator to 3. The association variables retain the values determined upon arrival of the last packet. In NTPv4 lambda increases with time, so eventually the synchronization distance exceeds the distance threshold MAXDIST, in which case the association is considered unfit for synchronization. 11. System Process As each new sample (theta, delta, epsilon, jitter, t) is produced by the clock filter algorithm, all peer processes are scanned by the mitigation algorithms consisting of the selection, cluster, combine and clock discipline algorithms in the system process. The selection algorithm scans all associations and casts off the falsetickers, which have demonstrably incorrect time, leaving the truechimers as result. In a series of rounds the cluster algorithm discards the association statistically furthest from the centroid until a specified minimum number of survivors remain. The combine algorithm produces the best and final statistics on a weighted average basis. The final offset is passed to the clock discipline algorithm to steer the system clock to the correct time. The cluster algorithm selects one of the survivors as the system peer. The associated statistics (theta, delta, epsilon, jitter, t) are used to construct the system variables inherited by dependent servers and clients and made available to other applications running on the same machine. 11.1. System Process Variables Figure 25 summarizes the common names, formula names and a short description of each system variable. Unless noted otherwise, all variables have assumed prefix s. Burbank, et al. Expires November 2, 2007 [Page 38] Internet-Draft NTPv4 Specification May 2007 +-----------+------------+------------------------+ | Name | Formula | Description | +-----------+------------+------------------------+ | t | t | update time | | p | p | system peer identifier | | leap | leap | leap indicator | | stratum | stratum | stratum | | precision | rho | precision | | offset | THETA | combined offset | | jitter | PSI | combined jitter | | rootdelay | DELTA | root delay | | rootdisp | EPSILON | root dispersion | | v | v | survivor list | | refid | refid | reference ID | | reftime | reftime | reference time | | NMIN | 3 | minimum survivors | | CMIN | 1 | minimum candidates | +-----------+------------+------------------------+ Figure 25: System Process Variables Except for the t, p, offset and jitter variables and the NMIN and CMIN constants, the variables have the same format and interpretation as the peer variables of the same name. The NMIN and CMIN parameters are used by the selection and cluster algorithms described in the next section. The t variable is the seconds counter at the last update determined by the clock_update() routine in Appendix A.5.5.4. The p variable is the system peer identifier determined by the cluster() routine in Section 11.2.2. The precision variable has the same format as the packet variable of the same name. The precision is defined as the larger of the resolution and time to read the clock, in log2 units. For instance, the precision of a mains-frequency clock incrementing at 60 Hz is 16 ms, even when the system clock hardware representation is to the nanosecond. The offset and jitter variables are determined by the combine() routine in Section 11.2.3. These values represent the best and final offset and jitter used to discipline the system clock. Initially, all variables are cleared to zero, then the leap is set to 3 (unsynchronized) and stratum is set to MAXSTRAT (16). Remember that MAXSTRAT is mapped to zero in the transmitted packet. 11.2. System Process Operations Figure 26 summarizes the system process operations performed by the clock_select() routine. The selection algorithm described in Burbank, et al. Expires November 2, 2007 [Page 39] Internet-Draft NTPv4 Specification May 2007 Section 11.2.1 produces a majority clique of presumed correct candidates (truechimers) based on agreement principles. The cluster algorithm described in Section 11.2.2 discards outlyers to produce the most accurate survivors. The combine algorithm described in Section 11.2.3 provides the best and final offset for the clock discipline algorithm described in Appendix A.5.5.6. If the selection algorithm cannot produce a majority clique, or if it cannot produce at least CMIN survivors, the system process exits without disciplining the system clock. If successful, the cluster algorithm selects the statistically best candidate as the system peer and its variables are inherited as the system variables. Burbank, et al. Expires November 2, 2007 [Page 40] Internet-Draft NTPv4 Specification May 2007 +-----------------+ | clock_select() | +-----------------+ ................................|........... . V . . yes +---------+ +-----------------+ . . +--| accept? | | scan candidates | . . | +---------+ | | . . V no | | | . . +---------+ | | | . . | add peer| | | | . . +---------- | | | . . | V | | . . +---------->-->| | . . | | . . Selection Algorithm +-----------------+ . .................................|.......... V no +-------------------+ +-------------| survivors? | | +-------------------+ | | yes | V | +-------------------+ | | Cluster Algorithm | | +-------------------+ | | | V V yes +-------------------+ |<------------| n < CMIN? | | +-------------------+ V | +-----------------+ V no | s.p = NULL | +-------------------+ +-----------------+ | s.p = v_0.p | | +-------------------+ V | +-----------------+ V | return (UNSYNC) | +-------------------+ +-----------------+ | return (SYNC) | +-------------------+ Figure 26: clock_select() Routine Burbank, et al. Expires November 2, 2007 [Page 41] Internet-Draft NTPv4 Specification May 2007 11.2.1. Selection Algorithm Note that the selection and cluster algorithms are described separately, but combined in the code skeleton. The selection algorithm operates to find an intersection interval containing a majority clique of truechimers using Byzantine agreement principles originally proposed by Marzullo [8], but modified to improve accuracy. An overview of the algorithm is given below and in the first half of the clock_select() routine in Appendix A.5.5.1. First, those servers which are unusable according to the rules of the protocol are detected and discarded by the accept() routine in Appendix A.5.5.3. Next, a set of tuples (p, type, edge) is generated for the remaining candidates. Here, p is the association identifier and type identifies the upper (+1), middle (0) and lower (-1) endpoints of a correctness interval centered on theta for that candidate. This results in three tuples, lowpoint (p, -1, theta - lambda), midpoint (p, 0, theta) and highpoint (p, +1, theta + lambda), where lambda is the root synchronization distance calculated on each use by the rootdist() routine in Appendix A.5.1.1. The steps of the algorithm are: 1. For each of m associations, place three tuples as defined above on the candidate list. 2. Sort the tuples on the list by the edge component. Order the lowpoint, midpoint and highpoint of these intervals from lowest to highest. Set the number of falsetickers f = 0. 3. Set the number of midpoints d = 0. Set c = 0. Scan from lowest endpoint to highest. Add one to c for every lowpoint, subtract one for every highpoint, add one to d for every midpoint. If c >= m - f, stop; set l = current lowpoint. 4. Set c = 0. Scan from highest endpoint to lowest. Add one to c for every highpoint, subtract one for every lowpoint, add one to d for every midpoint. If c >= m - f, stop; set u = current highpoint. 5. Is d = f and l < u? If yes, then follow step 5A; else, follow step 5B. 5A. Success: the intersection interval is [l, u]. 5B. Add one to f. Is f < (m / 2)? If yes, then go to step 3 again. If no, then go to step 6. 6. Failure; a majority clique could not be found. There are no suitable candidates to discipline the system clock. Burbank, et al. Expires November 2, 2007 [Page 42] Internet-Draft NTPv4 Specification May 2007 The algorithm is described in detail in Appendix A.5.5.1. Note that it starts with the assumption that there are no falsetickers (f = 0) and attempts to find a nonempty intersection interval containing the midpoints of all correct servers, i.e., truechimers. If a nonempty interval cannot be found, it increases the number of assumed falsetickers by one and tries again. If a nonempty interval is found and the number of falsetickers is less than the number of truechimers, a majority clique has been found and the midpoint of each truechimer (theta) represents the candidates available to the cluster algorithm. If a majority clique is not found, or if the number of truechimers is less than CMIN, there are insufficient candidates to discipline the system clock. CMIN defines the minimum number of servers consistent with the correctness requirements. Suspicious operators would set CMIN to insure multiple redundant servers are available for the algorithms to mitigate properly. However, for historic reasons the default value for CMIN is one. 11.2.2. Cluster Algorithm The candidates of the majority clique are placed on the survivor list v in the form of tuples (p, theta_p, psi_p, lambda_p), where p is an association identifier, theta_p, psi_p, and stratum_p the current offset, jitter and stratum of association p, respectively, and lambda_p is a merit factor equal to stratum_p * MAXDIST + lambda, where lambda is the root synchronization distance for association p. The list is processed by the cluster algorithm below and the second half of the clock_select() algorithm in Appendix A.5.5.1. 1. Let (p, theta_p, psi_p, lambda_p) represent a survivor candidate. 2. Sort the candidates by increasing lambda_p. Let n be the number of candidates and NMIN the minimum required number of survivors. 3. For each candidate compute the selection jitter psi_s: +----- -----+^1/2 | n-1 | | --- | | 1 \ 2 | psi_s = | ---- * / (theta_s - theta_j) | | n-1 --- | | j=1 | +----- -----+ 4. Select psi_max as the candidate with maximum psi_s. Burbank, et al. Expires November 2, 2007 [Page 43] Internet-Draft NTPv4 Specification May 2007 5. Select psi_min as the candidate with minimum psi_p. 6. Is psi_max < psi_min or n <= NMIN? If yes, follow step 6A; otherwise, follow step 6B. 6A. Done. The remaining candidates on the survivor list are ranked in the order of preference. The first entry on the list represents the system peer; its variables are used later to update the system variables. 6B. Delete the outlyer candidate with psi_max; reduce n by one and go back to step 3. The algorithm operates in a series of rounds where each round discards the statistical outlyer with maximum selection jitter psi_s. However, if psi_s is less than the minimum peer jitter psi_p, no improvement is possible by discarding outlyers. This and the minimum number of survivors represent the terminating conditions of the algorithm. Upon termination, the final value of psi_max is saved as the system selection jitter PSI_s for use later. 11.2.3. Combine Algorithm The remaining survivors are processed by the clock_combine() routine in Appendix A.5.5.5 to produce the best and final data for the clock discipline algorithm. The clock_combine() routine processes peer offset and jitter statistics to produce the combined system offset THETA and system peer jitter PSI_p, where each server statistic is weighted by the reciprocal of the root synchronization distance and the result normalized. The combined THETA is passed to the clock_update() routine in Appendix A.5.5.4. The first candidate on the survivor list is nominated as the system peer with identifier p. The system peer jitter PSI_p is a component of the system jitter PSI. It is used along with the selection jitter PSI_s to produce the system jitter: PSI = [(PSI_s)^2 + (PSI_p)^2]^1/2 Each time an update is received from the system peer, the clock_update() routine in Appendix A.5.5.4 is called. By rule, an update is discarded if its time of arrival p.t is not strictly later than the last update used s.t. The labels IGNOR, PANIC, ADJ and STEP refer to return codes from the local_clock() routine described in the next section. IGNORE means the update has been ignored as an outlyer. PANIC means the offset is greater than the panic threshold PANICT (1000 s) and Burbank, et al. Expires November 2, 2007 [Page 44] Internet-Draft NTPv4 Specification May 2007 SHOULD cause the program to exit with a diagnostic message to the system log. STEP means the offset is less than the panic threshold, but greater than the step threshold STEPT (125 ms). In this case the clock is stepped to the correct offset, but since this means all peer data have been invalidated, all associations MUST be reset and the client begins as at initial start. ADJ means the offset is less than the step threshold and thus a valid update. In this case the system variables are updated from the peer variables as shown in Figure 28. +-------------------------------------------+ | System Variable <-- System Peer Variable | | +-------------------------------------------+ | s.leap <-- p.leap | | s.stratum <-- p.stratum + 1 | | s.offset <-- THETA | | s.jitter <-- PSI | | s.rootdelay <-- p.delta_r + delta | | s.rootdisp <-- p.epsilon_r + p.epsilon + | | p.psi + PHI * (s.t - p.t) | | + |THETA| | | s.refid <-- p.refid | | s.reftime <-- p.reftime | | s.t <-- p.t | +-------------------------------------------+ Figure 28: System Variables Update There is an important detail not shown. The dispersion increment (p.epsilon + p.psi + PHI * (s.t - p.t) + |THETA|) is bounded from below by MINDISP. In subnets with very fast processors and networks and very small delay and dispersion this forces a monotone-definite increase in s.rootdisp (EPSILON), which avoids loops between peers operating at the same stratum. The system variables are available to dependent application programs as nominal performance statistics. The system offset THETA is the clock offset relative to the available synchronization sources. The system jitter PSI is an estimate of the error in determining this value, elsewhere called the expected error. The root delay DELTA is the total round trip delay relative to the primary server. The root dispersion EPSILON is the dispersion accumulated over the network from the primary server. Finally, the root synchronization distance is defined LAMBDA = EPSILON + DELTA / 2, Burbank, et al. Expires November 2, 2007 [Page 45] Internet-Draft NTPv4 Specification May 2007 which represents the maximum error due all causes and is designated the root synchronization distance. 11.3. Clock Discipline Algorithm The NTPv4 clock discipline algorithm, shortened to discipline in the following, functions as a combination of two philosophically quite different feedback control systems. In a phase-locked loop (PLL) design, periodic phase updates at update intervals mu seconds are used directly to minimize the time error and indirectly the frequency error. In a frequency-locked loop (FLL) design, periodic frequency updates at intervals mu are used directly to minimize the frequency error and indirectly the time error. As shown in [5], a PLL usually works better when network jitter dominates, while a FLL works better when oscillator wander dominates. This section contains an outline of how the NTPv4 design works. An in-depth discussion of the design principles is provided in [5], which also includes a performance analysis. The discipline is implemented as the feedback control system shown in Figure 29. The variable theta_r represents the combine algorithm offset (reference phase) and theta_c the VFO offset (control phase). Each update produces a signal V_d representing the instantaneous phase difference theta_r - theta_c. The clock filter for each server functions as a tapped delay line, with the output taken at the tap selected by the clock filter algorithm. The selection, cluster and combine algorithms combine the data from multiple filters to produce the signal V_s. The loop filter, with impulse response F(t), produces the signal V_c which controls the VFO frequency omega_c and thus the integral of the phase theta_c which closes the loop. The V_c signal is generated by the clock adjust process in Section 12. The detailed equations that implement these functions are best presented in the routines of Appendix A.5.5.6 and Appendix A.5.6.1. Burbank, et al. Expires November 2, 2007 [Page 46] Internet-Draft NTPv4 Specification May 2007 theta_r + +---------\ +----------------+ NTP --------->| Phase \ V_d | | V_s theta_c - | Detector ------>| Clock Filter |----+ +-------->| / | | | | +---------/ +----------------+ | | | ----- | / \ | | VFO | | \ / | ----- ....................................... | ^ . Loop Filter . | | . +---------+ x +-------------+ . | | V_c . | |<-----| | . | +------.-| Clock | y | Phase/Freq |<---------+ . | Adjust |<-----| Prediction | . . | | | | . . +---------+ +-------------+ . ....................................... Figure 29: Clock Discipline Feedback Loop Ordinarily, the pseudo-linear feedback loop described above operates to discipline the system clock. However, there are cases where a nonlinear algorithm offers considerable improvement. One case is when the discipline starts without knowledge of the intrinsic clock frequency. The pseudo-linear loop takes several hours to develop an accurate measurement and during most of that time the poll interval cannot be increased. The nonlinear loop described below does this in 15 minutes. Another case is when occasional bursts of large jitter are present due to congested network links. The state machine described below resists error bursts lasting less than 15 minutes. Figure 30 contains a summary of the variables and parameters including the variables (lower case) or parameters (upper case) name, formula name and short description. Unless noted otherwise, all variables have assumed prefix c. The variables t, tc, state, hyster and count are integers; the remaining variables are floating doubles. The function of each will be explained in the algorithm descriptions below. Burbank, et al. Expires November 2, 2007 [Page 47] Internet-Draft NTPv4 Specification May 2007 +--------+------------+--------------------------+ | Name | Formula | Description | +--------+------------+--------------------------+ | t | timer | seconds counter | | offset | theta | combined offset | | resid | theta_r | residual offset | | freq | phi | clock frequency | | jitter | psi | clock offset jitter | | wander | omega | clock frequency wander | | tc | tau | time constant (log2) | | state | state | state | | adj | adj | frequency adjustment | | hyster | hyster | hysteresis counter | | STEPT | 125 | step threshold (.125 s) | | WATCH | 900 | stepout thresh(s) | | PANICT | 1000 | panic threshold (1000 s) | | LIMIT | 30 | hysteresis limit | | PGATE | 4 | hysteresis gate | | TC | 16 | time constant scale | | AVG | 8 | averaging constant | +--------+------------+--------------------------+ Figure 30: Clock Discipline Variables and Parameters The discipline is implemented by the local_clock() routine, which is called from the clock_update() routine. The local_clock() routine in Appendix A.5.5.6 has two parts; the first implements the clock state machine and the second determines the time constant and thus the poll interval. The local_clock() routine exits immediately if the offset is greater than the panic threshold PANICT (1000 s). The state transition function is implemented by the rstclock() function in Appendix A.5.5.7. Figure 31 shows the state transition function used by this routine. It has four columns showing respectively the state name, predicate and action if the offset theta is less than the step threshold, the predicate and actions otherwise, and finally some comments. Burbank, et al. Expires November 2, 2007 [Page 48] Internet-Draft NTPv4 Specification May 2007 +-------+---------------------+-------------------+--------------+ | State | theta < STEP | theta > STEP | Comments | +-------+---------------------+-------------------+--------------+ | NSET | ->FREQ | ->FREQ | no frequency | | | adjust time | step time | file | +-------+---------------------+-------------------+--------------+ | FSET | ->SYNC | ->SYNC | frequency | | | adjust time | step time | file | +-------+---------------------+-------------------+--------------+ | SPIK | ->SYNC | if < 900 s ->SPIK | outlyer | | | adjust freq | else ->SYNC | detected | | | adjust time | step freq | | | | | step time | | +-------+---------------------+-------------------+--------------+ | FREQ | if < 900 s ->FREQ | if < 900 s ->FREQ | initial | | | else ->SYNC | else ->SYNC | frequency | | | step freq | step freq | | | | adjust time | adjust time | | +-------+---------------------+-------------------+--------------+ | SYNC | ->SYNC | if < 900 s ->SPIK | normal | | | adjust freq | else ->SYNC | operation | | | adjust time | step freq | | | | | step time | | +-------+---------------------+-------------------+--------------+ Figure 31: State Transition Function In the table entries the next state is identified by the arrow -> with the actions listed below. Actions such as adjust time and adjust frequency are implemented by the PLL/FLL feedback loop in the local_clock() routine. A step clock action is implemented by setting the clock directly, but this is done only after the stepout threshold WATCH (900 s) when the offset is more than the step threshold STEPT (.125 s). This resists clock steps under conditions of extreme network congestion. The jitter (psi) and wander (omega) statistics are computed using an exponential average with weight factor AVG. The time constant exponent (tau) is determined by comparing psi with the magnitude of the current offset theta. If the offset is greater than PGATE (4) times the clock jitter, the hysteresis counter hyster is reduced by two; otherwise, it is increased by one. If hyster increases to the upper limit LIMIT (30), tau is increased by one; if it decreases to the lower limit -LIMIT (-30), tau is decreased by one. Normally, tau hovers near MAXPOLL, but quickly decreases if a temperature spike causes a frequency surge. Burbank, et al. Expires November 2, 2007 [Page 49] Internet-Draft NTPv4 Specification May 2007 12. Clock Adjust Process The actual clock adjustment is performed by the clock_adjust() routine in Appendix A.5.6.1. It runs at one-second intervals to add the frequency correction and a fixed percentage of the residual offset theta_r. The theta_r is in effect the exponential decay of the theta value produced by the loop filter at each update. The TC parameter scales the time constant to match the poll interval for convenience. Note that the dispersion EPSILON increases by PHI at each second. The clock adjust process includes a timer interrupt facility driving the seconds counter c.t. It begins at zero when the service starts and increments once each second. At each interrupt the clock_adjust() routine is called to incorporate the clock discipline time and frequency adjustments, then the associations are scanned to determine if the seconds counter equals or exceeds the p.next state variable defined in the next section. If so, the poll process is called to send a packet and compute the next p.next value. 13. Poll Process Each association supports a poll process that runs at regular intervals to construct and send packets in symmetric, client and broadcast server associations. It runs continuously, whether or not servers are reachable in order to manage the clock filter and reach register. 13.1. Poll Process Variables Figure 32 summarizes the common names, formula names and a short description of the poll process variables(lower case) and parameters (upper case). Unless noted otherwise, all variables have assumed prefix p. Burbank, et al. Expires November 2, 2007 [Page 50] Internet-Draft NTPv4 Specification May 2007 +---------+---------+--------------------+ | Name | Formula | Description | +---------+---------+--------------------+ | hpoll | hpoll | host poll exponent | | last | last | last poll time | | next | next | next poll time | | reach | reach | reach register | | unreach | unreach | unreach counter | | UNREACH | 24 | unreach limit | | BCOUNT | 8 | burst count | | BURST | flag | burst enable | | IBURST | flag | iburst enable | +---------+---------+--------------------+ Figure 32: Poll Process Variables and Parameters The poll process variables are allocated in the association data structure along with the peer process variables. Following is a detailed description of the variables. The parameters will be called out in the following text. hpoll: signed integer representing the poll exponent, in log2 seconds last: integer representing the seconds counter when the m