NTP Interleaved On-Wire Protocol

gif

from Pogo, Walt Kelly

If it walks like a duck...

Introduction

The NTP on-wire protocol is the core mechanism to exchange time values between servers, peers and clients. It is inherently resistant to lost, duplicate or bogus packets. Data integrity is provided by the IP and UDP checksums. No flow-control or retransmission facilities are provided or necessary. The protocol uses timestamps, either extracted from packet headers or captured from the system clock upon the arrival or departure of a packet. This document describes the current basic on-wire protocol and proposes a new interleaved protocol capable of higher accuracy with hardware or driver assist.

The protocol operates in three modes, basic, interleaved symmetric and interleaved broadcast. Basic mode supports all current NTP modes, including client/server, symmetric and broadcast. Interleaved symmetric mode is an extension of basic symmetric mode, while interleaved broadcast mode is an extension of basic broadcast mode. The interleaved modes require servers and peers to retain state, so these modes do not support NTP client/server mode. Mode selection can be determined by configuration option or automatically by the protocol. In all modes the protocol can detect lost, duplicate and bogus packets and, in addition, lost, duplicate or bogus packets that violate the interleaving rules.

The interleaved protocol described in this document has been implemented and tested in the NTP reference implementation. In its present form only software timestamps (softstamps) are used, with the advantage that the actual transmit timestamp is captured upon return from the I/O routine and more accurately represents the on-wire hardware timestamp (hardstamp). Future plans include using driver timestamps or NIC timestamps as the opportunity arises.

The plan for this document is to first describe the protocol state machine, including the header variables and state variables. Pseudo-code is contained in a companion briefing document NTP Interleaved Protcocol for LANs and Space Data Links. Then follows a description of each mode along with an example of operation in a typical scenario. Finally is an example showing how the protocol responds to mode conflicts and automatically reconfigures.

Protocol State Machine

The protocol state machine consists of a transmit process, which is called for every transmitted packet, and a receive process, which is called for every received packet. Detailed example code segments are shown in the NTP Interleaved Protcocol for LANs and Space Data Links briefing. There are three header variables torg, trec and txmt, which are included in the NTP packet, and five state variables rec, dst, aorg, borg and xmt for each peer along with two variables b and x described in the code segments. The timestamps T1, T2, T3 and T4 are the products of the protocol used to compute the offset and delay, as described in the protocol specification. The state machine operates in three substates corresponding to basic, interleaved symmetric and interleaved broadcast modes.

Offset and Delay Calculations

The four timestamps, T1 through T4, are used to compute the offset of B relative to A

q = T(B) - T(A) = 1/2 [(T2 - T1) + (T3 - T4)]

and the roundtrip delay

d = T(ABA) = (T4 - T1) - (T3 - T2).

Note that the quantities within parentheses are computed from 64-bit unsigned timestamps and result in signed values with 63 significant bits plus sign. These values can represent dates from 68 years in the past to 68 years in the future. However, the offset and delay are computed as the sum and difference of these values, which contain 62 significant bits and two sign bits, so can represent unambiguous values from 34 years in the past to 34 years in the future. In other words, the time of the client must be set within 34 years of the server before the service is started. This is a fundamental limitation with 64-bit integer arithmetic.

In implementations where floating double arithmetic is available, the first-order differences can be converted to floating double and the second-order sums and differences computed in that arithmetic. Since the second-order terms are typically very small relative to the timestamps themselves, there is no loss in significance, yet the unambiguous range is increased from 34 years to 68 years.

Basic Mode

Basic mode represents the current NTP specification and implementation. The following example illustrates typical operation in symmetric modes starting from an unsynchronized condition.

gif

The above figure illustrates the most general case where symmetric peers A and B independently measure the offset and delay relative to the other. Each timestamp corresponds to the label at the head or tail of the arrows along the direction of transmission. The blue boxes hold timestamps captured from the system clock. The other boxes hold values copied from them, other state variables or packet headers.

The values shown are after arrival or departure of a packet. When a packet arrives, txmt is copied to rec and the receive timestamp to dst. When a packet departs, rec is copied to torg, dst to trec and the transmit timestamp to org and txmt. After the state variables have been updated, the timestamps are T1 = torg T2 = trec, T3 = txmt and T4 = dst.

In the following descriptions the packet contents are given in the form (torg, trec, txmt). To begin the round at t1, peer A sends (0, 0, t1) to B. Later, B sends (t1, t2, t3) to A. At t4 the header timestamps t1, t2, t3 and dst variable t4 are available to compute the offset and delay of B relative to A. In a similar fashion at t6, t3, t4, t5 and t6 are available to compute the offset and delay of A relative to B. The protocol is said to be synchronized when both A and B have computed the offset and delay; that is, at t6.

In symmetric modes each peer independently polls the other peer, but not necessarily at identical intervals. Thus, one or the other peer might receive none, one or more than one packet between polls. This can result in lost or duplicate packets and even cases where packets overlap each other in flight resulting in out of sequence or bogus packets. In addition, provisions must be made to reset and restart the protocol, if necessary. These details are in the imlementation, but beyond the scope of this discussion.

There are three sanity checks designed to deflect unsynchronized, duplicate or bogus packets. Before the state variables are updated, the packet is discarded as a duplicate if txmt matches rec from the last packet received. After the state variables have been updated, the packet is discarded as unsynchronized if either T1 or T2 are zero. The packet is discarded as bogus if T1 does not match org. Finally, although not shown in the figure, org is set to zero in order to deflect a replay of the first packet to the other peer. Thus, the round is protected against replays of both the first and second packets.

The protocol is inherently resistant to lost packets and overlapped packets. For instance, if packet t3-t4 is lost, the next set of timestamps t5, t6, t7 and t8, are available to compute offset and delay. If packets t3-t4 and t5-t6 overlap, peer A will discard the latter as bogus and use the next set of timestamps as before.

Interleaved Symmetric Mode

In the protocol described so far the transmit softstamp is captured before the MD5 digest is computed and the packet is sent, while the receive softstamp is captured after the packet is received. This is the one-step protocol currently specified and implemented. For enhanced accuracy it is desirable to capture the corresponding hardstamps as close to the wire as possible; i.e., with hardware assist or with a modified driver.

The problem is, while the receive hardstamp can be piggybacked in the receive buffer, the transmit hardstamp cannot ordinarily be transmitted in the same packet. In the designs to follow the transmit softstamp is captured before the packet is sent and overridden later by the corresponding hardstamp.

A solution for this is the two-step or interleaved protocol described in this section. In this variant the transmit hardstamp for one packet is actually transmitted in the immediately following packet. The trick, however, is to implement the interleaved protocol without changing the NTP packet header format, without compromising backwards compatibility and without compromising the error recovery properties. Following is a typical example of operation starting from an unsynchronized condition.

gif

The above figure illustrates a typical scenario involving peers A and B. Note that the receive (even-numbered) hardstamps are available immediately after the packet has been received, but the transmit (odd-numbered) hardstamps are available only after the packet has been sent, so are sent in the next following packet.

In contrast to the basic protocol, which requires one complete round to calculate offset and delay, the interleaved round requires two basic rounds. The round that begins at t1 is not complete until t8 and the round that begins at t5 is not complete until t12. However, the rate of offset/delay calculations is the same as the basic protocol. The packet header fields are the same in the interleaved protocol as in the basic protocol, but carry different values.

Each peer requires the five state variables, rec, dst, aorg, borg and xmt. In addition, there is an x bit which alternates between +1 and -1 for each packet sent. It is convenient to interpret the value x = 0 as an instance of the basic mode protocol when both basic and interleaved modes are available.

When a packet is received, trec is copied to rec and the receive hardstamp to dst. When a packet is transmitted, rec is copied to torg and dst to trec. If x = +1, the transmit hardstamp is copied to aorg and borg is copied to txmt. If x = -1, the transmit hardstamp is copied to borg and aorg is copied to txmt. Upon receipt and before the state variables have been updated, the timestamps are T2 = rec, T3 = txmt, and T4 = dst. If x = +1, T1 = aorg; if x = -1, T1 = borg. After this, trec is copied to rec and the receive hardstamp to dst.

There are three sanity checks designed to deflect duplicate, unsynchronized or bogus packets. While not shown in the figure, the xmt state variable is used only to detect duplicate packets. Before the state variables are updated, the packet is discarded as a duplicate if txmt matches xmt from the last packet received. Otherwise, txmt is copied to xmt. After the state variables and timestamps have been updated, the packet is discarded as unsynchronized if either T1, T2 or T3 are zero. The packet is discarded as bogus if torg is nonzero and does not match T4.

Note that the difference between the transmit hardstamp T4 and corresponding softstamp aorg represents the output MD5 digest and queueing delay. If this quantity is less than zero or greater than the maximum roundtrip delay, the packet is misordered and should be discarded. This is called the delay test; in practice, it has proved a very reliable means to detect lost or crossed packets.

In the above figure t8 is the end of the interleaved round that begins at t1. Before the state variables have been updated, the four timestamps t1, t2, t3 and t4 are available to determine the offset and delay of B relative to A. The reader can verify the four timestamps t3, t4, t5 and t6 are available at t10 to determine the offset and delay of A relative to B.

Interleaved Broadcast Mode

The protocols described so far are limited to symmetric modes where state can be preserved between arriving and departing packets. The client/server mode is inappropriate for the interleaved modes, since server state cannot be preserved. However, it is possible to provide hardware or driver timestamping in broadcast mode, much as in IEEE 1588 Precision Time Protocol (PTP) in multicast mode.

A PTP master broadcasts a Sync message to the clients (aka slaves), which capture the receive timestamp T2. Immediately thereafter the master broadcasts a Follow_up message including the transmit timestamp T1. Some time later each client separately sends a Delay_req message to the master and captures the transmit timestamp T3. The master returns a Delay_resp message containing the receive timestamp T4 for the Delay_req message. The client collects these timestamps to calculate the offset and delay as in the NTP protocol with the offset sign inverted.

The principal difference between the interleaved broadcast protocol and PTP is that the broadcast transmit timestamp T1 is actually sent in the following broadcast packet and the remaining timestamps are obtained by the same stateless protocol used in NTP client/server modes, although slightly modified.

gif

The figure above shows an instance where the clock is updated as each broadcast packet arrives. The protocol uses the same packet header format as the basic protocol, but includes the transmit softstamp sent in the packet followed by the corresponding hardstamp sent in the following packet. To emphasize this, the softstamps in the figure are starred. Softstamps are not used in the interleaved protocol; they are included to support both basic and interleaved modes with the same packet stream.

In the current NTP specification and implementation the torg and trec header fields are unused in broadcast mode and ordinarily set to zero. Clients that support interleaved broadcast mode will note that torg is nonzero and in that case set state variable b = 1. The broadcast client now executes a stateless client/server mode exchange ending at t8. Note that, if b = 1 trec is copied to rec for use when the next broadcast packet arrives.

Upon the arrival of the next broadcast packet at t12, the timestamps T1 = torg, T2 = borg, T3 = aorg, and T4 = dst can be used to calculated the offset and delay in the usual way. However, note that the sign of the offset is reversed. If the stateless exchange is not repeated, the client keeps the delay to use as later broadcast packets arrive.

Note that the difference between the transmit hardstamp T4 and corresponding softstamp rec (before rec is updated) represents the output MD5 digest and queueing delay. As in the interleaved symmetric delay test, if this quantity is less than zero or greater than the maximum roundtrip delay, the packet is misordered and should be discarded.

Error Detection and Recovery

Recovery from a lost, duplicate or misordered packet can be tricky in the interleaved modes. In interleaved symmetric mode a single lost or duplicate packet is rejected by the various sanity tests, while in either interleaved mode misordered packets are rejected by the delay test.

In the scheme of things, error packets are rare, especially in fast LANs where propagation delays are very small and poll intervals are in the order of many seconds. Tests using the current reference implementation with simulated packet drop probability 0.1 in either interleaved mode show occasional bogus and delay rejections, but no offset or delay errors.

However, there is one instance where this can be useful. It occurs when a peer is initially configured to operate in interleaved mode and is presented with a client capable only of basic mode. This is illustrated in the figure below.

gif

In this figure A is the client and B the server. The client begins operation in basic mode as usual. The server stumbles along until failing the bogus test at t10. As per protocol, it resets and begins operating in basic mode. After synchronizing again, both client and server assume normal operation. A similar case occurs if the server is started in basic mode and the client starts in interleaved mode.

Reference Implementation and Scenarios

As previously mentioned, the interleaved modes have been implemented and tested in the reference implementation. However, a code surveyor might find it difficult to follow the code flow, as it is intertwined with authentication and rate management systems. There is, however, a simple test program called lev.c which can be used to simulate all modes described in this document, as well as common error scenarios.

Currently, the reference implementation uses only software timestamps (softstamps). The receive softstamp is captured at software interrupt time and before the buffer is queued for later processing. The transmit softstamp is captured upon return from the send-packet routine. Both softstamps are probably as accurate as possible without driver or hardware intervention.

The reference implementation captures a softstamp before the message digest routine and another after the send-packet routine. The difference, called the interleaved or output delay, varies from 16 ms for a dual-core, 2.8 GHz Pentium 4 running FreeBSD 6.1 to 1100 ms for a Sun Blade 1500 running Solaris 10. In the following scenarios the network is a switched Ethernet operating at 100 Mb and the NTP packet is about 1000 bits or 10 ms.

On two identical Pentium 4 machines in symmetric mode, the measured output delay is 16 ms and remaining one-way delay components 45-150 ms. Two LAN segments account for 20 ms, which leaves 25-130 ms for input delay. The RMS jitter is 30-50 ms measured by the clock filter.

On two identical UltraSPARC machines running Solaris 10 in symmetric mode, the measured output delay is 160 ms and remaining one-way delay components 195 ms. Two LAN segments account for 20 ms, which leaves 175 ms for input delay. The RMS jitter is 40-60 ms measured by the clock filter.

A natural conclusion is that the output delay is in general quite stable and most of the jitter is contributed by the network and input delay. Therefore, a preliminary design might include provisions only for the receive drivestamp or hardstamp.

Performance Improvements

There are two ways to improve accuracy - driver timestamping and hardware timestamping. In driver timestamping the operating system device driver is augmented to capture driver timestamps (drivestamps) as close to the hardware as possible. If activated by a socket option, receive drivestamps can be saved in the buffer structure for later retrieval. Transmit drivestamps can be returned to the daemon in a socket option or as a pseudo-input packet as if received from the remote server or peer. There is a lot more that needs to said about retrieving transmit drivestamps, but these remain to be resolved.

The following example is from FreeBSD, but other systems may have a similar structure. Perhaps the most convenient way to implement drivestamps is using the ifnet or if_data structure. A logical place to capture drivestamps may be in the ether_input() and ether_output() routines in if_ethersubr() source code file. While possibly not as accurate as in the driver itself, it has the advantage that the scheme is driver independent.

In the simplest design, the only use for drivestamps or hardstamps is to calculate the amount of time to add to the softstamp to determine the hardstamp, so it doesn't matter whether the timestamp is represented in Unix, NTP or IEEE 1588 format. The simplest way might be to read the TSC/PCC and let the NTP deamon figure it out. What the daemon can do is read the TSC/PCC at softstamp time and use the drivestamp to calculate the actual hardstamp. This requires adjusting for the TSC/PCC rate, which is availble somewhere in the kernel. A similar thing can be done on input.

Hardware timestamping will most likely require intervention in the driver itself. IEEE 1588 NIC designs known to me have provisions to arm an input or output timestamp capture and hold the catured value in a register retrieved in a separate operation. In addition, there are provisions to adjust the frequence of the NIC clock, which is usually a TCXO and adjustable overflow counter. In driver designs known to me the captured values can be retrieved and the frequency adjusted using ioctls. One design uses the AMD PCNET chip and modified PHY using a FPGA.

The FPGA appears as a separate device from the Ethernet device. The user hangs a read on the FPGA device and waits for an interrupt before reading the hardstamp. The Ethernet driver interrupt routine is modified to test for a IEEE 1588 packet and when found triggers the FPGA interrupt routine. In priciple, the FreeBSD pcn driver could be modified to do the same thing.