Timestamping Schemes and Accuracy Expectations David L. Mills and Stuart Venters 21 February 2008 Introduction Over the 29 years NTP and its predecessors have evolved, accuracy expectations have improved from 100 ms to less than 1 ms on fast LANs with multiple segments interconnected by switches and less than a few millisecondss on most campus and corporate networks with multiple LANs interconnected by routers. However, the ultimate accuracy limit for typical workstations and PCs of today is a few microseconds with the precision kernel and PPS signal. Improving the expected accuracy to this level and beyond with NTP will require improved hardware and software technology. In principle the ultimate accuracy is limited only by the 232-ps resolution of the NTP timestamp format or about the time light travels four inches. One of the main strengths of NTP is that it has been, and will continue to be, implemented in a wide variety of scenarios with significantly different available resources and accuracy requirements. Furthermore, timekeeping is not an exact science. No implementation provides perfect time. As such, this appendix proposes a timestamping scheme which is not an implementation requirement, but rather a specification for a model in which specific schemes can interoperate without significant loss of accuracy. Requirements In general, and even with faster hardware and network links, improved accuracy can only be achieved with some kind of hardware assist or special provisions in the network interface card (NIC) or device driver. To better understand the issues, consider the ultimate case where the server and client implement clocks that can be read with exquisite accuracy. The object is to measure the time offset of a server relative to the client. The NTP on-wire specification uses what is called here the reference timestamps T1, T2, T3 and T4, respectively called the originate, receive, transmit and destination timestamps. T1 and T4 are struck by peer A from its clock, while T2 and T3 are struck by peer B from its clock. The object of the protocol is to determine the time offset of B relative to A and the roundtrip delay: offset = [(T2 - T1) + (T3 - T4)] / 2 (1) delay = (T4 - T1) - (T3 - T2). (2) The precision to which the offset and delay can be calculated depends on the precision to which the timestamps can be struck. In general, it is best to strike the timestamps as close to the physical media as possible, in order to avoid various queuing and buffering latencies. The software timestamping scheme used in the current NTP reference implementation attempts to approximate the reference timestamps as follows: 1. T1 and T3 are struck by the output packet routine just before handing the packet to the operating system, driver and NIC. Applicable latencies include output queuing, kernel buffering, NIC buffering and possible retransmissions. 2. T2 and T4 are struck by the driver at interrupt time. Applicable latencies include NIC buffering, interrupt response time and operating system scheduling. If these latencies can be avoided, the remaining errors are due only to network delays for bit propagation time, packet transmission time and queuing latencies. Inspection of (1) shows that the best accuracy is obtained when the delays on the outbound path T1 -> T2 and inbound path T3 -> T4 are statistically identical; in this case we say the delays are reciprocal. Further refinement shows that, if the reciprocal delays differ by x seconds, the resulting offset error is x/2 seconds. Lack of reciprocity is due to two causes, different data rates on the outbound and inbound paths and where in the packet the timestamps are struck. If the overall data rate is not the same on either direction, errors can result. If the transmitter and receiver choose different points in the packet to strike a timestamp, errors can result. What happens depends on the hardware and software design, as described below. There are many workable schemes to implement timestamp capture. Using a different scheme at each end of the link is likely to result in a delay mismatch. The following provisions apply: 1. The capture location should be relative to the physical media at some point in the frame that can be recorded by current hardware, firmware and software designs. 2. A preamble timestamp is struck as near to the start of the frame as possible. For Ethernet systems the preferred point follows the last bit of the preamble and SOF octet and before the first bit of the data. 3. A trailer timestamp is struck as near to the end of the frame as possible. On transmit this follows the last bit of the data and before the checksum; on receive this follows the last bit of the checksum. 4. In addition to the timestamps, the NIC and/or driver must provide both the nominal transmission rate and length of the frame between the preamble and trailer timestamps. This can be used by the driver or application to transpose between the preamble and trailer timestamps without significant loss of accuracy. The transposition error with acceptable frequency tolerance of 300 PPM for 100-Mb Ethernets and nominal 1000-bit NTP packet is less than 3 ns. Whatever timestamping scheme is developed, it should allow interworking between schemes so that every combination of schemes used by the server and client results in the highest accuracy possible. This can be achieved if both the transmitter and receiver select the same from among the preamble and trailer timestamps. For instance, hardware timestamping (IEEE 1588) schemes use preamble timestamps for reference, while driver timestamping schemes use trailer timestamps for reference. On the other hand, in software timestamping schemes (current NTP), the transmitter uses preamble timestamps while the receiver uses trailer timestamps. Proposed Timestaming SchemeScheme With the above requirements in mind, it is possible to select either the preamble or trailer timestamp at either the transmitter or receiver and to transpose so that both select the same reference point. The natural choice is the preamble timestamp, as this is consistent with IEEE 1588 and likely to be supported by available hardware. With and without transparent bridges or boundary clocks, accuracies with NTP will be of the same order as 1588. With driver timestamping both the transmitter and receiver must transpose the trailer timestamp to the preamble timestamp according to the respective data rate and packet length. With software timestamping the transmitter already uses the preamble timestamp; the receiver must transpose the trailer timestamp to the preamble timestamp. Example An NTP packet (about 1000 bits) is 1 us on a 1000-Mb LAN, 10 us on a 100-Mb LAN and 650 us on a T1 line. As shown later, in order to drive the residual NTP offsets down to PPS levels, the reciprocal delays must match within less than 10 us. if the reciprocal transmission rates and packet lengths are the same to within 10 us, or one packet time on a 100-Mb LAN, the accuracy goal can be met. As an example of current practice, NTP servers macabre and mort are identical Intel Pentium clones running FreeBSD and synchronized to a GPS receiver. The servers share a lightly loaded 100-Mb switch; the GPS receiver is connected by fiber repeaters and another 100-Mb switch. Each server shows offsets of about 25 us relative to the other and the GPS receiver. Each server shows roundtrip delays of about 140 us with the other. Note that only 40 us (four LAN hops) is due to frame transmission times, with the remaining 100 us due to buffering in the operating systems and NICs. Error Analysis of Timestamp Schemes The analysis so far does not account for various statistical latencies nor does it accounts for systematic errors resulting from nonreciprocal paths. The NTP code path delay and the delay to read the system clock are substantially the same on either direction, so cancel out. As previously noted, the Ethernet transceivers themselves contribute about 100 us per pair. While these delays are generally constant, various preemptive latencies can occur due to device interrupts, timeslice-end and so forth. In addition, NTP traffic typically competes with other network traffic causing additional latencies. The principal remaining terms in the error budget are nonreciprocal delays due to different data rates and nonuniform transposition between the preamble and trailer timestamps. The errors due such causes are summarized in following sections. In the following, upper case variables represent the reference timestamps used in (1) and (2), where they are assumed struck at the beginning of the frame. Lower case variables represent the actual timestamps struck by the hardware or software. The on-wire protocol uses the actual timestamps in the same fashion as (1) and (2). The object is to explore the possible errors that might result from difference timestamping strategies. Software Timestamp Scheme In the NTP reference implementation the actual t1 and t3 timestamps are struck just before launching the packet via the operating system, which results in various queuing latencies and buffering delays. In Unix, for example, the packet buffer is copied to internal kernel buffers which then are passed to the NIC. Modern NICs have 16-K FIFOs shared between the transmit and receive sides and separate PHY buffers for each side. The NIC driver manages the FIFOs and buffers to achieve maximum throughput, but might not be sensitive to latencies. We assume t1 = T1 and t3 = T3, but note in passing that the queuing latencies and buffering delays have become components of the delay budget. On the receive side the t2 and t4 timestamps are struck upon arrival of a frame delayed by the NIC FIFOs and interrupt latency. The NIC driver receives an interrupt to capture the trailer timestamp for the preceding frame, but this might not always be the case. In anticipation of the arrival, the NTP program has allocated a buffer in user space. When a complete frame arrives, the driver copies it to the buffer, strikes a timestamp and saves it in the buffer structure. We assume t2 = T2 + d and t4 = T4 + d, where d represents the frame transmission time. The NTP on-wire protocol performs the same calculations as (1) and (2) but using the actual timestamps. After substitution we have offset = {[(T2 + d) - T1] + [T3 - (T4 + d)]} / 2, which after simplification is the same as (1). On the other hand, delay = [T4 + d) - T1] - [T3 - (T2 + d)], which results in a delay error of 2d. We conclude that, if we neglect the latencies summarized above, and as long as the transmission rates on the reciprocal paths are the same and the frame lengths are the same, the offset is as in (1) without dilution of accuracy. On a point to point link it's okay for the receive and transmit timestamps to be different as long as each end uses the same choices. Hardware Timestamping Scheme There are two ways to reduce the queuing and buffering latencies, hardware timestamping and driver timestamping. Hardware timestamping is assumed something like IEEE 1588, in which an event such as the passage of the Ethernet SOF octet, latches a counter. The latch is read by the NIC firmware or software driver at a later time, converted to NTP format and passed up the protocol stack for use by the NTP program, this avoids substantially all queuing and buffering latencies in the software design. One problem with this approach is the conversion between counter time and NTP time. The counter is very likely running at a rate incommensurate with NTP time and so must be disciplined in time and perhaps frequency. With hardware timestamping the actual preamble timestamps are struck from hardware and are thus identical to the reference timestamps. However, t1 and t3 are valid only if the frame is the final successful transmission. We conclude that, as long as the transmission rates on the reciprocal paths are the same, the offset and delay can be computed as in (1) and (2) without dilution of accuracy. On a point to point link, both transmitter and receiver using preamble timestmaps works. Driver Timestamping Scheme In driver timestamping a trailer timestamp is struck for each frame sent or received. However, the timestamp is available only at driver interrupt time; that is, at the end of the frame and before the checksum on transmit and after the checksum on receive. Assuming the timestamps can be converted to NTP format and passed up the protocol stack for use by the NTP program, this avoids most of the output queue and kernel buffer latencies of the software design. With driver timestamping there is a delay d between each upper case variable and the corresponding actual lower case variable, so t1 = T1 + d, T2 = t2 + d, t3 = T3 + d and T4 = t4 + d. In this case we neglect the time to transmit the checksum, which is 32 ns for 1000-Mb LANs and 320 ns for 100-Mb LANs. Then, offset = {[(T2 + d) - (T1 + d)] + [(T3 + d) - (T4 + d)]} / 2, which after simplification is the same as (1). On the other hand, delay = [(T4 + d) - (T1 + d)] - [(T3 + d) - (T2 + d)], which after simplification is the same as (2). We conclude that, as long as the transmission rates on the reciprocal paths are the same and the frame lengths are the same, the offset and delay can be computed as in (1) and (2) without dilution of accuracy. On a point to point link, both transmitter using trailer timestamps works. One and Two Step Protocols In hardware and driver timestamping the frame itself does not contain the actual timestamp; that timestamp is available only after the frame has been sent. This requires a modification of the protocol to include the timestamp in a later packet. The result is the two-step protocol used by IEEE 1588 and proposed for NTP in the architecture briefing on the NTP project page. While the details remain to be worked out, it is anticipated that the existing one-step protocol and proposed two-step protocol can be automatically detected during regular protocol operations. Errors Due to Interworking Between Software, Hardware and Driver Schemes Recall that reference timestamps are struck at the beginning of the frame on both transmit and receive and so are invariant to the transmission rates and frame length. However, trailer timestamps are struck at the end of the frame, which is later than the reference timestamps depending on the transmission rate and frame length. So, what happens when interworking between various combinations of software, hardware and driver schemes without proper transposition? Let A be hardware and B driver. Then, offset = {[T2 - (T1 + d)] + [T3 - (T4 + d)]} / 2, which results in an error of -2d. And delay = [T4 + d) - (T1 + d)] - (T3 - T2), which results in no error. Let A be driver and B software. Then, offset = {[(T2 + d) - (T1 + d)] + [T3 - (T4 + d)]} / 2, which results in an offset error of -d. And delay = [(T4 + d) - (T1 + d)] - [T3 - (T2 + d)), which results in a delay error of d. Other combinations are possible. Errors Due to Nonreciprocal Delays and Transmission Rates Consider a scenario with two LAN segments connected by a switch, where one segment operates at 10 Mb and the other at 100 Mb. In schemes where timestamps are struck at the end of the packet, errors can occur due to the different packet transmission times. Let dA be the packet time for A and dB be the packet time for B. For the driver timestamping scheme, offset = {[(T2 + dB) - T1] + [T3 - (T4 + dA)]} / 2, which results in an offset error (dB - dA) / 2. On the other hand, delay = [(T4 + dA) - T1)] - [T3 - (T2 + dB)], which results in a delay error dB + dA. Note that if a software timestamping scheme is chosen for the above example, then, offset = {[(T2 + dB + dA) - T1] + [T3 - (T4 + dA + dB)]} / 2, which results in a zero offset error. On the other hand, delay = [(T4 + dA + dB) - T1)] - [T3 - (T2 + dB + dA)], which results in a delay error 2 * (dB + dA). From this we can conclude that it's not only the timestamping schemes at A and B which must match, some consideration must also be given to the forwarding behavior of the switches connecting A and B when the link speeds differ. Errors Due to Nonreciprocal Delays As a practical software timestamping example, consider two servers Sun Ultra pogo (A) and Intel Pentium deacon (B), both synchronized to PPS sources showing typical offset and jitter less than 5 us. Both are connected to bridged 100-Mb LAN segments with d = 20 us for two hops. The roundtrip delay measured by either machine is about 400 us and the jitter about 25 us. The measured offset of pogo relative to rackety is 89 us, while the measured offset of rackety relative to pogo is -97 us. The fact that the measured offsets are almost equal and with opposite sign suggests that the two servers agree closely on the PPS time. Of the measured roundtrip delay, 40 us is frame transmission times; the remaining 360 us must be due buffereing in the operating system and NICs. From the above analysis, the offset error is consistent with one path having 200 us more overall delay than the other. If rackety is similar to the case with macabre and mort example above, this would suggest rackety accounts for 50 us, which would leave 250 us for pogo. The main difficulty with this analysis is that over 100 us is unaccounted for. The resulting time offset illustrates the importance of considering the measurement latencies on differing implementations even when the same software timestamping scheme is chosen. However, it would be in principle possible to calbrate the delay contributions of each machine relative to the SOF and include this in the driver information along with the transmission rate and frame length. Errors Due to Asymmetric Overall Transmission Rates A transmission path will typically be two or more concatenated segments that might operate at different rates. Assume the packet time d = p + L/r, where p is the propagation time, L the packet length in bits and r the transmission rate in bits per second. Now consider the concatenated path D = (p1 + L/R1) + (p2 + l/R2) + ... + (pn + L/Rn), where n is the number of segments. If we assume the outbound and return paths traverse the same segments (in any order), the total transmission time will be the same. If the reference timestamp is taken from the beginning of the packet, the delays are reciprocal and accuracy is not diluted. On the other hand, if R1 and Rn are not the same rate, the offset error will be L(1/R1 - 1/Rn). It is possible to back-time a timestamp struck at the end of the packet to the start time, but this is probably not justified for a 1000-Mb LAN and only marginally so for a 100-Mb LAN. The above can be written D = sum(pi) + L sum(1/Ri) (i = 1...n), where sum(1/Ri) is the overall transmission rate, which in the example is considered the same on both directions. Now consider where the overall transmission rates are not the same on both directions, such as often the case with spaecraft links. Let r12 be the overall outbound transmission rate and r34 the overall inbound transmission rate. Calculate the offset and delay as above and add the correction t = [r34/(r12 + r34) - 1/2] * delay to the offset.