David L. Mills and Stuart Venters
Introduction
Over the 29 years NTP and its predecessors have evolved, accuracy expectations have improved from 100 ms to less than 1 ms on fast LANs with multiple segments interconnected by switches and less than a few milliseconds on most campus and corporate networks with multiple LANs interconnected by routers. However, the ultimate accuracy limit for typical workstations and PCs of today is a few microseconds with a GPS receiver and PPS signal.
Improving the expected NTP accuracy to this level and beyond on LANs will require improved hardware and software technology. In principle the ultimate accuracy is limited only by the 232-ps resolution of the NTP timestamp format or about the time light travels four inches.
One of the main strengths of NTP is that it has been, and will continue to be, implemented in a wide variety of scenarios with significantly different available resources and accuracy requirements. Furthermore, timekeeping is not an exact science. No implementation provides perfect time. As such, this white paper proposes a timestamp scheme which is not an implementation requirement, but rather a specification for a model in which specific schemes can interoperate without significant loss of accuracy.
Requirements and Principles
In general, and even with faster hardware and network links, improved accuracy can only be achieved with some kind of hardware assist or special provisions in the network interface card (NIC) or device driver. To better understand the issues, consider the ultimate case where the server and client implement clocks that can be read with exquisite accuracy. The object is to measure the time offset of a server relative to the client.
Figure 1
As shown in the Figure 1, the NTP on-wire specification uses what is called here the reference timestamps T1, T2, T3 and T4, respectively called the originate, receive, transmit and destination timestamps. T1 and T4 are struck by peer A from its clock, while T2 and T3 are struck by peer B from its clock. The object of the protocol is to determine the time offset of B relative to A and the roundtrip delay:
offset q = [(T2 - T1) + (T3 - T4)] / 2, (1)
delay d = (T4 - T1) - (T3 - T2). (2)
The precision to which the offset and delay can be calculated depends on the precision to which the timestamps can be struck. In general, it is best to strike the timestamps as close to the physical media as possible, in order to avoid various queuing and buffering latencies. The software timestamping scheme used in the current NTP reference implementation attempts to approximate the reference timestamps as follows:
- T1 and T3 are struck by the output packet routine just before handing the packet to the operating system, driver and NIC. Applicable latencies include output queuing, kernel buffering, NIC buffering and possible retransmissions.
- T2 and T4 are struck by the driver at interrupt time. Applicable latencies include NIC buffering, interrupt response time and operating system scheduling.
If these latencies can be avoided, the remaining errors are due only to network delays for bit propagation time, packet transmission time and queuing latencies. Inspection of (1) shows that the best accuracy is obtained when the delays on the outbound path T1 > T2 and inbound path T3 > T4 are statistically identical; in this case we say the delays are reciprocal. Further refinement shows that, if the reciprocal delays differ by x seconds, the resulting offset error is x/2 seconds.
Lack of reciprocity is due to two causes, different data rates on the outbound and inbound paths and where in the packet the timestamps are struck. If the overall data rate is not the same on either direction, errors can result. If the transmitter and receiver choose different points in the packet to strike a timestamp, errors can result. What happens depends on the hardware and software design, as described below.
There are many workable schemes to implement timestamp capture. Using a different scheme at each end of the link is likely to result in a delay mismatch. The following provisions apply:
- The capture location should be relative to the physical media at some point in the frame that can be recorded by current hardware, firmware and software designs.
- A preamble timestamp is struck as near to the start of the frame as possible. For Ethernet systems the preferred point follows the last bit of the preamble and SOF octet and before the first bit of the data.
- A trailer timestamp is struck as near to the end of the frame as possible. On transmit this follows the last bit of the data and before the checksum; on receive this follows the last bit of the checksum.
- In addition to the timestamps, the NIC and/or driver must provide both the nominal transmission rate and length of the frame between the preamble and trailer timestamps. This can be used by the driver or application to transpose between the preamble and trailer timestamps without significant loss of accuracy. The transposition error with acceptable frequency tolerance of 300 PPM for 100-Mb Ethernets and nominal 1000-bit NTP packet is less than 3 ns.
As shown later in this document, the best way to preserve accuracy when single or multiple network segments are involved, some possibly operating at different speeds, is the following:
- The propagation delay measured from the first bit sent in a packet to the first bit received on each direction of transmission must be the same. This is called the reciprocity rule.
- T1 and T3 must be struck from the the preamble timestamp.
- T2 and T4 must be the struck from the trailer timestamp.
Whatever timestamping scheme is developed, it should allow interworking between schemes so that every combination of schemes used by the server and client results in the highest accuracy possible. As shown later, this can be achieved only using the above rules.
Timestamp Transposition
With the above requirements in mind, it is possible to select either the preamble or trailer timestamp at either the transmitter or receiver and to transpose so that both select the same reference point. The natural choice is the preamble timestamp, as this is consistent with IEEE 1588 and likely to be supported by available hardware. With and without transparent bridges or boundary clocks, accuracies with NTP will be of the same order as 1588. However, as shown below, this does not work if there is a bridge or router between the server and client and both operate at different rates.
With driver timestamps the transmitter must transpose the trailer timestamp to the preamble timestamp according to the respective data rate and packet length. With hardware timestamps the receiver must transpose the preamble timestamp to the trailer timestamp.
Example
An NTP packet (about 1000 bits) is 1 ms on a 1000-Mb LAN, 10 ms on a 100-Mb LAN and 650 ms on a T1 line. As shown later, in order to drive the residual NTP offsets down to PPS levels, the reciprocal delays must match within less than 10 ms. if the reciprocal transmission rates and packet lengths are the same to within 10 ms, or one packet time on a 100-Mb LAN, the accuracy goal can be met.
As an example of current practice, NTP servers macabre and mort are identical Intel Pentium clones running FreeBSD and synchronized to a GPS receiver. The servers share a lightly loaded 100-Mb switch; the GPS receiver is connected by fiber repeaters and another 100-Mb switch. Each server shows offsets of about 25 ms relative to the other and the GPS receiver. Each server shows roundtrip delays of about 140 ms with the other. Note that only 40 ms (four LAN hops) is due to frame transmission times, with the remaining 100 ms due to buffering in the operating systems and NICs.
One-Step and Two-Step Protocols
With hardware and driver timestamps the frame itself does not always contain the actual preamble or trailer timestamps, since those timestamps are available only after the frame has been sent. This requires a modification of the protocol to include the timestamps in a later packet.
Figure 2
The method used by IEEE 1588 is shown in Figure 2. The master starts by broadcasting a Sync packet, which the slave receives at T2. Later, the master broadcasts a Follow_Up packet containing T1. At T3 the slave sends a Delay_Req packet to the master, which sends a Delay_Resp packet containg T4. The slave calculates the offset and delay as in (1) and (2), but notes the delay has opposite sign. Note that the Sync and Follow_Up packets are sent only once during each round; the Delay_Req/Delay_Resp exchange must be completed for each slave separately.
Figure 3
Figure 3 shows the current NTP one-step on-wire protocol. Each message sent includes two timestamps as shown in parens. A completes two rounds T1 - T4 and T5 - T8, while B completes two rounds T3 - T6 and T7 - T10.
Figure 4
Figure 4 shows the proposed NTP two-step on-wire protocol. Note that the transmit timestamp is sent in the following packet, so that A completes one round T1 - T8, while B completes one round T3 - T10. Note that, during the T1 - T8 round A has initiated a new round at T7 and during the T3 - T10 round B nas initiated a new round at T7, so the rounds run concurrently.
While the details remain to be worked out, it is anticipated that the existing one-step protocol and proposed two-step protocol can be automatically detected during regular protocol operations.
Error Analysis
The analysis so far does not account for various statistical latencies nor does it accounts for systematic errors resulting from nonreciprocal paths. The NTP code path delay and the delay to read the system clock are substantially the same on either direction, so cancel out. As previously noted, the Ethernet transceivers themselves contribute about 100 ms per pair. While these delays are generally constant, various preemptive latencies can occur due to device interrupts, timeslice-end and so forth. In addition, NTP traffic typically competes with other network traffic causing additional latencies.
The principal remaining terms in the error budget are nonreciprocal delays due to different data rates and nonuniform transposition between the preamble and trailer timestamps. The errors due such causes are summarized below.
Figure 5
In the following, upper case variables represent the reference timestamps used in (1) and (2). Lower case variables shown in Figure 5 represent the actual timestamps struck by the hardware or software. The on-wire protocol uses the actual timestamps in the same fashion as (1) and (2). The object is to explore the possible errors that might result from different timestamp strategies.
Software Timestamp Errors
In the NTP reference implementation the actual t1 and t3 timestamps are struck just before launching the packet via the operating system, which results in various queuing latencies and buffering delays. In Unix, for example, the packet buffer is copied to internal kernel buffers which then are passed to the NIC. Modern NICs have 16-K FIFOs shared between the transmit and receive sides and separate PHY buffers for each side. The NIC driver manages the FIFOs and buffers to achieve maximum throughput, but might not be sensitive to latencies. We assume t1 = T1 and t3 = T3, but note in passing that the queuing latencies and buffering delays have become components of the delay budget.
On the receive side the t2 and t4 timestamps are struck upon arrival of a frame delayed by the NIC FIFOs and interrupt response time. The NIC driver receives an interrupt to capture the trailer timestamp for the preceding frame, but this might not always be the case. In anticipation of the arrival, the NTP program has allocated a buffer in user space. When a complete frame arrives, the driver copies it to the buffer, strikes a timestamp and saves it in the buffer structure. We assume t2 = T2 + d and t4 = T4 + d, where d represents the frame transmission time.
Figure 6
As shown in Figure 6, the NTP on-wire protocol performs the same calculations as (1) and (2) but using the actual timestamps. After substitution we have
q = {[(T2 + d) - T1] + [T3 - (T4 + d)]} / 2,
which after simplification is the same as (1). On the other hand,
d = [(T4 + d) - T1] - [T3 - (T2 + d)],
which results in an additional delay of 2d. We conclude that, if we neglect the latencies summarized above, and as long as the transmission rates on the reciprocal paths are the same and the frame lengths are the same, the offset is as in (1) without dilution of accuracy.
Hardware Timestamp Errors
There are two ways to reduce the queuing and buffering latencies, hardware timestamps and driver timestamps. Hardware timestamps are assumed something like IEEE 1588, in which an event such as the passage of the Ethernet SOF octet, latches a counter. The latch is read by the NIC firmware or software driver at a later time, converted to NTP format and passed up the protocol stack, this avoids substantially all queuing and buffering latencies in the software design.
One problem with this approach is the conversion between counter time and NTP time. The counter is very likely running at a rate incommensurate with NTP time and so must be disciplined in time and perhaps frequency.
With hardware timestamps the actual preamble timestamps are struck from hardware and are thus identical to the reference timestamps. However, t1 and t3 are valid only if the frame is the final successful transmission. We conclude that, as long as the transmission rates on the reciprocal paths are the same, the offset and delay can be computed as in (1) and (2) without dilution of accuracy.
Driver Timestamp Errors
Wit driver timestamps a trailer timestamp is struck for each frame sent or received. However, the timestamp is available only at driver interrupt time; that is, at the end of the frame and before the checksum on transmit and after the checksum on receive. Assuming the timestamps can be converted to NTP format and passed up the protocol stack, this avoids most of the output queue and kernel buffer latencies of the software design.
With driver timestamps there is a delay d between each upper case variable and the corresponding actual lower case variable, so t1 = T1 + d, T2 = t2 + d, t3 = T3 + d and T4 = t4 + d. In this case we neglect the time to transmit the checksum, which is 32 ns for 1000-Mb LANs and 320 ns for 100-Mb LANs. Then,
q = {[(T2 + d) - (T1 + d)] + [(T3 + d) - (T4 + d)]} / 2,
which after simplification is the same as (1). On the other hand,
d = [(T4 + d) - (T1 + d)] - [(T3 + d) - (T2 + d)],
which after simplification is the same as (2). We conclude that, as long as the transmission rates on the reciprocal paths are the same and the frame lengths are the same, the offset and delay can be computed as in (1) and (2) without dilution of accuracy.
Interworking Errors
Recall that reference timestamps are struck at the beginning of the frame on both transmit and receive and so are invariant to the transmission rates and frame length. However, trailer timestamps are struck at the end of the frame, which is later than the reference timestamps depending on the transmission rate and frame length. So, what happens when interworking between various combinations of software, hardware and driver timestamps without proper transposition?
Let A use hardware timestamps and B driver timestamps. Then,
q = {[T2 - (T1 + d)] + [T3 - (T4 + d)]} / 2
results in an offset error of -2d, while
d = [(T4 + d) - (T1 + d)] - (T3 - T2),
results in no error.
Let A use driver timestamps and B software timestamps. Then,
q = {[(T2 + d) - (T1 + d)] + [T3 - (T4 + d)]} / 2
results in an offset error of -d, while
d = [(T4 + d) - (T1 + d)] - [T3 - (T2 + d))
results in a delay increase of d. Other combinations are possible.
Store and Forward Delay Errors
Consider a scenario with two LAN segments connected by a switch or router, where one segment operates at 10 Mb and the other at 100 Mb. Even with hardware timestamps errors can occur due to the different packet transmission times.
Figure 7
In Figure 7 let dA be the packet time for A and dB be the packet time for B. The router sends the packet to B only after the packet has been received from A; that is, there is no cut-through. With hardware timestamps,
q = {[(T2 + dA) - T1] + [T3 - (T4 + dB)]} / 2
results in an offset error (dA - dB) / 2. On the other hand,
d = [(T4 + dB) - T1)] - [T3 - (T2 + dA)]
results in a delay increase of dA + dB.
If a software timestamping scheme is chosen for the above example, then,
q = {[(T2 + dA + dB) - T1] + [T3 - (T4 + dA + dB)]} / 2
results in a zero offset error. On the other hand,
d = [(T4 + dA + dB) - T1)] - [T3 - (T2 + dA + dB)]
results in a delay increase of 2(dB + dA).
From this we can conclude that it's not only the timestamping schemes at A and B which must match, some consideration must also be given to the forwarding behavior of the switches connecting A and B when the link speeds differ.
Nonreciprocl Delay Errors
As a practical software timestamp example, consider two servers Sun Ultra pogo (A) and Intel Pentium deacon (B), both synchronized to PPS sources showing typical offset and jitter less than 5 ms. Both are connected to bridged 100-Mb LAN segments with d = 20 ms for two hops. The roundtrip delay measured by either machine is about 400 ms and the jitter about 25 ms. The measured offset of pogo relative to rackety is 89 ms, while the measured offset of rackety relative to pogo is -97 ms.
The fact that the measured offsets are almost equal and with opposite sign suggests that the two servers agree closely on the PPS time. Of the measured roundtrip delay, 40 ms is frame transmission times; the remaining 360 ms must be due buffereing in the operating system and NICs. From the above analysis, the offset error is consistent with one path having 200 ms more overall delay than the other. If rackety is similar to the case with the macabre and mort example above, this would suggest rackety accounts for 50 ms and leave 250 ms for pogo.
The main difficulty with this analysis is that over 100 ms is unaccounted for. The resulting time offset illustrates the importance of considering the measurement latencies on different machines and operating systems, even when the same software timestamping scheme is chosen. However, it would be in principle possible to calbrate the delay contributions of each machine relative to the SOF and include this in the driver information along with the transmission rate and frame length.
Nonreciprocal Rate Errors
A transmission path will typically be two or more concatenated network segments that might operate at different rates. Assume the packet time d = p + L/r, where p is the propagation time, L the packet length in bits and r the transmission rate in bits per second. Now consider the concatenated path
D = (p1 + L/r1) + (p2 + L/r2) + ... + (pn + L/rn),
where n is the number of segments. If we assume the outbound and return paths traverse the same segments (in any order), the total transmission time will be the same. If the timestamps are taken as described previously, the delays are reciprocal and accuracy is not diluted.
The above can be written
D = sum(pi) + L sum(1/ri) (i = 1...n),
where sum(1/ri) is the overall transmission rate, which in the example is considered the same on both directions. Now consider where the overall transmission rates are not the same on both directions, such as often the case with spaceraft links. Let r12 be the overall outbound transmission rate and r34 the overall inbound transmission rate. Calculate the offset and delay as above and add the correction
t = D[r34/(r12 + r34) - 1/2]
to the offset.