For many years now I’m operating a Prosody XMPP server at home for private messaging between family members and friends. Together with Conversations on Android devices and the Dino App for Linux it’s the perfect solution. There is one client device, however, to which messages sometimes take a couple of minutes to be delivered. Also I noticed that it frequently reconnects to the XMPP server. That client device is usually connected to a French mobile network so I assumed that they probably have a very short NAT timeout on their gateway that kills the TCP session to the server before either the client or the server sends some sort of keep-alive message. Not a big deal so far but since Conversations has been extended with voice and video calls, call establishment fails to the device every now and then. Time for having a closer look.
And indeed, the NAT gateway of that mobile network operator has a super short timeout, well below the 10 minutes keep-alive timer Conversations uses on the application layer. So I had a look if I can make the Linux server on which I host the Prosody XMPP server to send layer 4 TCP keep-alive messages. By default, the Linux kernel sets the TCP keep alive timer at 7200 seconds, i.e. 2 hours. That’s a bit long. But there’s a simple way to set it to another value:
sudo echo 270 > /proc/sys/net/ipv4/tcp_keepalive_time # To make this permanent, add the following line to /etc/sysctl.cfg (Ubuntu) net.ipv4.tcp_keepalive_time=270
In practice, changing the value by writing the number of seconds to this virtual file takes immediate effect, even for already established TCP connections. As I’m not keen on having too many TCP keep-alives for the sake of mobile device power saving I started my experiments at 9 minutes. That mobile network, however, wouldn’t have it. I finally ended up with a meager 270 seconds, i.e. 4.5. minutes. Anything more and the NAT gateway would kill the TCP session and thus cause a connection outage and reconnect. Seriously!?
So a TCP keep alive every 4.5 minutes to ALL connections is the price for keeping the connection alive to the mobile device in this network. Not ideal but I can live with that. The picture at the beginning of the post, by the way, shows how the network still removed the NAT entry even if a TCP keep-alive was sent after 7 minutes.