Time: 八月 5, 2013
Category: tcp/ip internals

Transmission of packets from TCP sender is restricted by the congestion window. On reception of Ack TCP sender may increase the congestion window. At the start of TCP connection sender is in slow start wherein congestion window starts growing exponentially from a small value (typically 2) till a threshold is reached. After reaching the threshold TCP enters congestion avoidance phase and increments the congestion window linearly.

Congestion avoidance state machine

Event such as arrival of dupack, SACK and Explicit congestion notification indicate a possibility of congestion due to excess transmission of packets by the sender. These events are processed through a state machine which has following states:

  1. TCP_CA_Open
  2. TCP_CA_Disorder
  4. TCP_CA_Recovery
  5. TCP_CA_Loss

some flag set




This is the default start state for the TCP connection. In this state connection will increase the congestion window as per slow start and congestion avoidance by calling tcp_cong_avoid. Every ack is checked for being dubious by calling tcp_ack_is_dubious. If the ack is declared dubious then tcp_fastretrans_alert changes CA state as per the event. Dubious ack will raise cwnd by calling tcp_cong_avoid if permitted by tcp_may_raise_cwnd.


This state indicates that TCP is experiencing some disorder in the form of Sacks and dupacks. The heuristics of connection are observed for some time to understand whether this is a genuine loss.


This state indicates that there some indication of congestion such as ICMP source crunch, local device congestion; due to which TCP sender slowed packed transmission.


The state indicates that sender is fast retransmitting the packets on detection of packet loss.


This state is entered when retransmission timer times out for some packet or the ack received indicates that the Sack information remembered by the sender is not in sync with the tcp receiver.

Detailed functionality:

Raising the cwnd (tcp_cong_avoid)

  1. During Slow start phase:
  2. Congestion window, snd_cwnd is incremented by one for every ack received. Effectively the congestion window is doubled in every round trip.
  3. Slow start size threshold, snd_ssthresh marks the maximum size to which snd_cwnd grows during slow start phase.
  4. snd_cwnd is never allowed to grow beyond snd_cwnd_clamp

During Congestion avoidance phase Once snd_cwnd reaches snd_ssthreash, snd_cwnd is incremented at a slow pace. The snd_cwnd is incremented by one per round trip. snd_cwnd is never allowed to grow beyond snd_cwnd_clamp

Checking for Dubious Ack (tcp_ack_is_dubious)

An ack is considered dubious if ANY of the following three conditions holds:

static inline bool tcp_ack_is_dubious(const struct sock *sk, const int flag)
  return !(flag & FLAG_NOT_DUP) || (flag & FLAG_CA_ALERT) || 
    inet_csk(sk)->icsk_ca_state != TCP_CA_Open;
  1. CA state is not Open
  2. if ALL the following are true: The packet received doesnt carry data && The ack is not updating the receive window in TCP header && The packet is not acking new data
  3. The packet indicates congestion, if ANY of the following is true: Packet has SACK || data ECE bit is set

Permission for raising cwnd (tcp_may_raise_cwnd)

On getting the dubious ack the congestion window will be allowed for raise if ALL the following conditions hold:
pre class="prettyprint lang-cpp">static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
  const struct tcp_sock *tp = tcp_sk(sk);
  return (!(flag & FLAG_ECE) || tp->snd_cwnd < tp->snd_ssthresh) &&

  1. CA state is not Recovery or CWR
  2. Either ECE flag is not set or the cwnd is smaller than slow start threshold

Processing dubious ack event (tcp_fastretrans_alert)

1. SACK reneging (tcp_check_sack_reneging?): When a SACK is received the packets being SACKED are marked. A clearing ack would typically cover up for the sacked packets. However, if the ack received points to a remembered SACK that probably indicates that the knowledge of SACK is erroneous. Following actions are taken:

static bool tcp_check_sack_reneging(struct sock *sk, int flag)
  if (flag & FLAG_SACK_RENEGING) {
    struct inet_connection_sock *icsk = inet_csk(sk);

    tcp_enter_loss(sk, 1);
    tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
    return true;
  return false;

Enter loss (tcp_enter_loss?):

void tcp_enter_loss(struct sock *sk, int how)
  const struct inet_connection_sock *icsk = inet_csk(sk);
  struct tcp_sock *tp = tcp_sk(sk);
  struct sk_buff *skb;
  bool new_recovery = false;

  /* Reduce ssthresh if it has not yet been made inside this window. */
  if (icsk->icsk_ca_state <= TCP_CA_Disorder || !after(tp->high_seq, tp->snd_una)
    || (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
    new_recovery = true;
    tp->prior_ssthresh = tcp_current_ssthresh(sk);
    tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
    tcp_ca_event(sk, CA_EVENT_LOSS);
  tp->snd_cwnd       = 1;
  tp->snd_cwnd_cnt   = 0;
  tp->snd_cwnd_stamp = tcp_time_stamp;


  if (tcp_is_reno(tp))

  tp->undo_marker = tp->snd_una;
  if (how) {
    tp->sacked_out = 0;
    tp->fackets_out = 0;

  tcp_for_write_queue(skb, sk) {
    if (skb == tcp_send_head(sk))

    if (TCP_SKB_CB(skb)->sacked & TCPCB_RETRANS)
      tp->undo_marker = 0;
    if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED) || how) {
      TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
      TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
      tp->lost_out += tcp_skb_pcount(skb);
      tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;

  tp->reordering = min_t(unsigned int, tp->reordering,
  tcp_set_ca_state(sk, TCP_CA_Loss);
  tp->high_seq = tp->snd_nxt;

  /* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous
   * loss recovery is underway except recurring timeout(s) on
   * the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing
  tp->frto = sysctl_tcp_frto &&
    (new_recovery || icsk->icsk_retransmits) &&
  1. Remember the sequence number of the packet to be sent next(snd_nxt value) at the onset of congestion, as high_seq.
  2. Reduce cwnd to 1
  3. undo_marker = snd_una
  4. Queue CWR for indicating congestion situation to peer.(ref TCP_ECN_queue_cwr)
  5. If the state is CWR or recovery (Rate halving will happen), prior_ssthresh is set to the value of ssthresh.
  6. Change CA state to Loss

Retransmit the packet at the top of write queue (which was erroneously marked as sacked earlier)

2.Time to recover (tcp_time_to_recover) This function examines various parameters (like number of packets lost) for TCP connection to decide whether its the right time to move to Recovery state. Its time to recover when TCP heuristics suggest a strong possibility of packet loss in the network, the following checks are made:

static bool tcp_time_to_recover(struct sock *sk, int flag)
  struct tcp_sock *tp = tcp_sk(sk);
  __u32 packets_out;

  /* Trick#1: The loss is proven. */
  if (tp->lost_out)
    return true;
  /* Not-A-Trick#2 : Classic rule... */
  if (tcp_dupack_heuristics(tp) > tp->reordering)
    return true;

  /* Trick#4: It is still not OK... But will it be useful to delay
   * recovery more?
  packets_out = tp->packets_out;
  if (packets_out <= tp->reordering &&
      tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
      !tcp_may_send_now(sk)) {
    /* We have nothing to send. This connection is limited
     * either by receiver window or by application.
    return true;

  /* If a thin stream is detected, retransmit after first
   * received dupack. Employ only if SACK is supported in order
   * to avoid possible corner-case series of spurious retransmissions
   * Use only if there are no unsent data.
  if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
      tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
      tcp_is_sack(tp) && !tcp_send_head(sk))
    return true;

  /* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
   * retransmissions due to small network reorderings, we implement
   * Mitigation A.3 in the RFC and delay the retransmission for a short
   * interval if appropriate.
  if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
      (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) 
         && !tcp_may_send_now(sk))
    return !tcp_pause_early_retransmit(sk, flag);
  return false;
  1. some packets are lost (lost_out is non zero)
  2. SACK is an acknowledgement for out of order packets. If number of packets Sacked is greater than the reordering metrics of the network, then loss is assumed to have happened
  3. If the following three conditions are true, TCP sender is in a state where no more data can be transmitted and number of packets acked is big enough to assume that rest of the packets are lost in the network:
    1. If packets in flight is less that the reordering metrics ( condition 2 will never be true)
    2. more than half of the packets in flight have been sacked by the receiver or number of packets sacked is more than the Fast Retransmit thresh. (Fast Retransmit thresh is the number of dupacks that sender awaits before starting fast retransmission)
    3. the sender can not send any more packets because either it is bound by the sliding window or the application has not delivered any more data to it in anticipation of Ack for already provided data.

If its declared to be the time to recover; CA State would switch to Recovery.

3.Try to open (tcp_try_to_open) If its not yet the time to move to recovery state, tcp_try_to_open will check for the following:

static void tcp_try_keep_open(struct sock *sk)
  struct tcp_sock *tp = tcp_sk(sk);
  int state = TCP_CA_Open;

  if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
    state = TCP_CA_Disorder;

  if (inet_csk(sk)->icsk_ca_state != state) {
    tcp_set_ca_state(sk, state);
    tp->high_seq = tp->snd_nxt;

static void tcp_try_to_open(struct sock *sk, int flag, const int prior_unsacked)
  struct tcp_sock *tp = tcp_sk(sk);


  if (!tcp_any_retrans_done(sk))
    tp->retrans_stamp = 0;

  if (flag & FLAG_ECE)
    tcp_enter_cwr(sk, 1);

  if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
    if (inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
  } else {
    tcp_cwnd_reduction(sk, prior_unsacked, 0);
  1. If the packet is indicating ECE then state will switch to CWR. Cwnd wil be reduced by calling tcp_cwnd_down and reduction will be indicate to peer by queuing CWR notification.
  2. If the state is not CWR; if we have any sacked or retransmitted packets set the state to Disorder and call tcp_moderate_cwnd.
  3. tcp_moderate_cwnd In Disorder state TCP is still unsure of genuineness of loss, after receiving acks with sack there may be a clearing ack which acks many packets non dubiously in one go. Such a clearing ack may cause a packet burst in the network, to avoid this cwnd size is reduced to allow no more than max_burst (usually 3) number of packets.

4.Update scoreboard (tcp_update_scoreboard) This function will mark all the packets which were not sacked (till the maximum seq number sacked) as lost packets. Also the packets which have waited for the acks to arrive for interval equivalent to retransmission time are marked as lost packets. The accounting for lost , sacked and left packets is also done in this function.

5.Transmit packets (tcp_xmit_retransmit_queue)

  1. The packets which are marked as lost and not yet retransmitted are retransmitted till all lost packets are transmitted or cwnd limit is reached.
  2. If the CA state is recovery (which means packets were actually lost on network) :
  3. If permitted by cwnd, new packets are transmitted.
  4. If new packets can not be transmitted though cwnd allows more transmissions then already transmitted packets are retransmitted (since there is evidence of lossy network)

6.Processing subsequent dubious acks The processing of the subsequent dubious acks will depend on the current CA state:
1 If the current state is Recovery
If a partial ack arrives acking some of the data in transit call (tcp_try_undo_partial):

  1. tcp_may_undo : Check whether any retransmissions were done. If the timestamp is enabled we may further check whether the ack is generated for the retransmitted packet of the original packet. This indicates whether TCP unnecessarily switched the CA state and should undo the actions taken.
  2. If tcp_may_undo permits:
  3. (tcp_undo_cwr):
  4. undo the cwnd reduction. If prior_ssthresh is zero bring cwnd to ssthresh level. If prior ssthresh is non zero, restore cwnd to its value at the time of cwnd reduction. Restore ssthresh to prior_ssthresh. (prior ssthresh may be set to zero when cwnd undo is not intended e.g ECE reception)
  5. Moderate cwnd to prevent packet burst in network (tcp_moderate_cwnd)
  6. call TCP_ECN_withdraw_cwr(ref)

set is_dupack to false in order to prevent retransmission of further packets.
2 If the current state is Loss

  1. tcp_try_undo_loss: If a partial ack arrives acking some of the data in transit; call tcp_undo_cwr for cwnd restoration if allowed by tcp_may_undo.
  2. If undo is not allowed, re-moderate cwnd (tcp_moderate_cwnd) and retransmit packets by calling tcp_xmit_retransmit_queue.

3 If the current state is not Recovery or Loss

  1. If state is Disorder call tcp_try_undo_dsack: [Whenever a D-sack arrives , it indicates that packet was received twice by the receiver which means retransmission of packet was unnecessary. Undo_retrans variable is decremented for each dsack.] If undo_retrans is set to zero call tcp_undo_cwr.
  2. Call tcp_time_to_recover to check if it is the time to recover, if not then call tcp_try_to_open and wait for subsequent acks.
  3. If it is the time to recover switch to Recovery state and retransmit more packets by calling tcp_xmit_retransmit_queue.

7.Clearing ack received When sender TCP receives an ack for the highest sequence number that was transmitted at the time of CA state switch from Open ; CA state machine will exit towards Open state.
1 If the current state is Loss
Attempt recovery (tcp_try_undo_recovery):

  1. tcp_may_undo is called to see if ack was for original packet and the changes may be undone. If permitted by tcp_may_undo, tcp_undo_cwr is called to recover cwnd and ssthresh values.
  2. After undoing the cwnd reduction: if the ack is received for packets beyond high_seq then switch to Open state, else wait for processing of subsequent acks before announcing a state switch.

2 If the current state is CWR

  1. If seq number greater than high_seq is acked, it indicates that the CWR indication has reached the peer TCP, call tcp_complete_cwr to bring down the cwnd to ssthresh value.
  2. switch to Open state.

3 If the current state is disorder

  1. Call tcp_try_undo_dsack for possibility of undoing cwnd changes.
  2. When an ack is received for packet higher than high_seq, switch to Open state. This is a safe point for switching the state as dupacks are no more expected.

4 If the current state is recovery

  1. Call tcp_try_undo_recovery for restoring the cwnd and switching to Open state.
  2. Call tcp_complete_cwr to bring down the cwnd to ssthresh value.


Leave a Comment