TCP OPERATION

Initiating a connection

TCP connections are opened using a three-way handshake. A segment is sent with the SYN bit set, and an initial sequence number, usually chosen using the local time-of-day clock. The receiving entity will check if a process is listening at the specified destination port, in which case it will usually accept the connection. To do this it responds with another segment with both SYN and ACK bits set, and includes a sequence number of its own, which it will use to number the return stream. To refuse the connection it sends a segment with RST set. Finally, the originator completes the three-way acknowledgement by sending an ACK segment.

Data transfer

During data transmission, each segment sent has to be acknowledged. When a segment is sent a retransmission timer is started, and if the timeout interval expires before the segment is acknowledged, the sender will retransmit. Segments must be delivered in the correct order and the simplest approach if a timeout occurs is to adopt a go-back-N strategy. When acknowledging incoming segments, an entity may elect to wait for a returning segment on which to piggyback an acknowledgement. A TCP entity sending a segment with next expected sequence number, i, implicitly acknowledges all bytes received up to i+1, so it is not actually necessary to send an ACK for every segment which arrives (see Figure 1). For example Windows employs a strategy known as delayed acknowledgements (RFC 1122), where only every second segment on a connection is acknowledged unless a delay of 200ms passes without a further arrival.

Figure 1      Flow control in TCP

Go-back-N is simple but inefficient for connections with a long RTT. An option to use a form of selective retransmission called selective acknowledgement (SACK) was introduced in RFC2018, allowing a receiver to acknowledge non-contiguous blocks of data. In this case segments arriving out of order are buffered until those missing are received. The feature is yet another option available in the TCP header which holds the start and end pointers to the data block being acknowledged.

TCP uses a credit-based flow controlmechanism with the receiver using the window size field (see Figure 1) to inform the sender of how much buffer space it can allocate to the connection at this time (the credit). A TCP entity can exercise discretion as to when it sends on data that it has received, either from a peer or a user. This discretion ends if a PUSH operation is detected, as this forces the entity to send on the data that it has received before the PUSH. However, the receiving entity can only pass on as much data as the user process is willing to take and this determines the rate at which it can frees up buffer space and issue credits to the sender. If the receiving user accepts data in small bursts, the receiving TCP entity will send frequent tiny window revisions to the sender, each advertising very small credits. The result will be many segments with very small payloads and thus a significant overhead. This phenomenon is sometimes called the silly window syndrome. A solution is to require the receiver not to send credits until a reasonable space has been freed up (e.g. a segment's worth).

Protection against wraparound

TCP assumes that a packet in the network will have a maximum lifetime known as the MSL (maximum segment lifetime) commonly assumed to be about 120s. The sequence numbers used on TCP connections are only 32-bits corresponding to 4Gbytes of data. On a fast link it would only take a much less than the MSL to exhaust this space and it would then be possible for two segments with the same number to be in the network at the same time resulting in potential misinterpretation. Protection against wrapped sequence numbers (PAWS) may be provide using the timestamp option.

Releasing a connection

To release a connection, a TCP entity simply sends a segment with FIN set. The other side responds with an ACK and this effectively half-closes the connection. The latter may continue to transmit for some time, but, when finished, will send its own FIN segment to which the former will also respond with a final ACK. When an entity sends a FIN, it sets a timeout on the response (usually within twice the maximum packet lifetime), and will terminate the connection in any case, if an ACK is not forthcoming. Even when all FINs and ACKs have been exchanged, the entities will wait for a packet lifetime, to ensure all activity has stopped before deleting the connection completely.

Congestion Control

TCP includes a feature to attempt to control internal congestion in the network by shaping traffic according to how frequently segments fail to get acknowledged. This is possible since the credit system allows ACKs to be returned even if the receiver has no buffer space. Although acknowledgement failures may be due to packet corruption in the network, it is reasonable to assume that most losses are due to IP routers discarding packets in presence of congestion. Rather than continuing to resend such segments, TCP slows its transmitter down . Note that in cases of mild congestion, ACKs may be able to return to a sender even if a segment is lost. In this case, the sender may become aware of the lost segment before the timeout interval expires through the mechanism of seeing duplicate ACKs. The sender may then elect to retransmit before the timeout interval has expired, a strategy known as fast retransmit. In practice TCP will usually wait for a third copy of an ACK before triggering a retransmit.

The transmitter maintains a congestion window, cwnd, analogous to its send window, which indicates how many bytes it may send without further acknowledgement. The sender will actually only send to the limit dictated by the minimum of the current values of congestion and send windows (the credit available from the receiver), a value known as the current window. When a connection is set-up, the congestion window is set to the MSS for the connection. As each segment on the connection is successfully acknowledged, the congestion window is doubled. This rapid rate of growth of traffic is called the slow start phase and it continues until timeouts begin to occur, or a set congestion threshold is reached. Once the congestion threshold is attained, the increase of the congestion window size slows to a linear increase of a fraction of an MSS per successful ACK (the fraction is MSS/cwnd), a phase called congestion avoidance. The congestion threshold begins at 64K, but every time a timeout occurs or a duplicate ACK is received, it is cut to half the current window size.  ICMPsource quench packets are treated as timeouts, and trigger the same reaction. In the case of a timeout, the congestion window is also reset to the beginning again, and starts to grow via slow start. However, if duplicate ACKs are detected, the sender may assume that the congestion is not so serious and simply cuts the congestion window in half without initiating slow start: this more gentle approach is called fast recovery

Timeouts

Timeouts, of course, are a much more awkward problem than at the link level, because it is very difficult to set a retransmission time that is reasonable for prevailing network conditions, which change dynamically. If the interval is set too short, timeouts will occur unnecessarily; if too long there will be additional delay every time a packet is lost. In fact the timeout value for the retransmission timer is adjusted dynamically. This is done by observing RTTs, adjusting the timeout value downwards whenever an ACK successfully returns sooner than expected, and upwards when it returns later than expected.