Blog

KB: 24032014-001: Dealing with TIME WAIT exhaustion (no more TCP connections)

Article keyword index

Sympton

1) You have received a time_wait_checker notification 2) or after a review, you find that all services are up and working but they cannot accept more connections (even though they should) or some services cannot connect to internal services like database servers (like MySQL). At the same time, if you run the following command you get more than 30000 entries:

>> netstat -n | grep TIME_WAIT | wc -l

Another sympton is that specific services like squid fails with the following error:

commBind: Cannot bind socket FD 98 to *:0: (98) Address already in use

Affected releases

All releases may suffer this problem. It’s not a bug but a resource exhaustion problem.

Background

The problem is triggered because some service is creating connections faster than they are collected for reuse. After TCP connection is closed, a period is started to receive lost packages for that closed connection, to avoid mixing them for newly created connections with the same location.

Every time a TCP connection is created, a local port is needed for the local end point. This local end point port is taken from the ephemeral ports range as defined at:

/proc/sys/net/ipv4/ip_local_port_range

If that range is exhausted, no more TCP connections can be created because there is no more ephemeral (local port) available.

There are several ways to solve this problem. The following solutions are listed in the order as they are recommended:

  1. Identify the application that is creating those connections to review if there is a problem.
  2. If it is not possible, increase ephemeral port range. See next section.
  3. If increasing ephemeral port range does not solve the problem, try reduce the amount of time a connection can be in TIME WAIT. See next section.
  4. If that does not solve the problem, try activating TCP time wait reuse option which will cause the system to reuse ports that are in TIME WAIT. See next section.

Solution

In the case you cannot fix the application producing these amount of TIME_WAIT connections, use the following options provided by the time_wait_checker to configure the system to better react to this situation.

  1. First, select the machine and then click on Actions (at the top-right of the machines’ view):actions
  2. Now, click on “Show machine’s checkers”:
  3. Then select the “time_wait_checker” and after that, click on:
    configure.
  4. After that, the following window will be showed to configure various TIME_WAIT handling options:Core-Admin Time-wait's checker options

Now use options available as described and using them in the recommended order.

Some notes about Tcp time wait reuse and recycle options

Special mention to Tcp time wait recycle  ( /proc/sys/net/ipv4/tcp_tw_recycle ) option is that it is considered more aggressive than Tcp time wait reuse  ( /proc/sys/net/ipv4/tcp_tw_reuse ). Both can cause problems because they apply “simplifications” to reduce the wait time and to reuse certain structures. In the case of Tcp time wait recycle, given its nature, it can cause problems with devices behind a NAT by allowing connections in a random manner (just one device will be able to connect to the server with this option enabled). As indicated by the tool, “Observe after activation”. More information about Tcp time wait recycle and how it relates to NATed devices at http://troy.yort.com/improve-linux-tcp-tw-recycle-man-page-entry/

In general, both options shouldn’t be used if not needed.

Long term solutions

The following solutions are not quick and requires preparation. But you can consider them to avoid the problem in the long term.

SOLUTION 1: If possible, update your application to reuse connections created. For example, if those connections are because internal database connections, instaed of creating, querying and closing, try to reuse the connection as much as possible. That will reduce a lot the number of pending TIME_WAIT connection in many cases.

SOLUTION 2: Another possible solution is to use several IPs for the same service and load-balance it through DNS (for example). That way you expand possible TCP location combination that are available and thus, you expand the amount of ephemeral ports available. Every IP available and serving the service double your range.

In any case, SOLUTION 1 by far best than SOLUTION 2. It is better to have a service consuming fewer resources.

Posted in: KB, Linux Networking

Leave a Comment (0) ↓

Leave a Comment

You must be logged in to post a comment.