A tale of VoIP, NAT and some confused Engineers

Foreword

NAT and PAT, in their many forms, are so common these days that almost all of CloudCall’s customers use them. SIP based VoIP was designed long before NAT was common place so it doesn’t work hugely well when NAT is involved. As such, VoIP Service Providers implement a number of mechanisms to work around the complexities of NAT whilst trying our best not to compromise security.

RTP and NAT

RTPWikipedia: The Real-time Transport Protocol (RTP) is a network protocol for delivering audio and video over IP networks. RTP is used in communication and entertainment systems that involve streaming media, such ... is just a continuous stream of UDP packets from one VoIP device to another. As we know, UDP packets have a source IP and port as well as a destination IP and port. These are negotiated between the VoIP devices in SIPWikipedia: The Session Initiation Protocol (SIP) is a signaling protocol used for initiating, maintaining, and terminating communication sessions that include voice, video and messaging applications.[1] SIP is u... and SDPWikipedia: The Session Description Protocol (SDP) is a format for describing multimedia communication sessions for the purposes of announcement and invitation.[1] Its predominant use is in support of streaming m.... When a VoIP device is behind NAT, the IP and port that it puts in SDP is usually wrong as the NAT router will change these when the RTP packets leave the network. As such, a common mechanism used by VoIP Service Providers is to wait for some RTP packets to be received from the remote VoIP device and use their source IP and port in preference to that sent in SDP.

The RTP+NAT Security Conundrum

The more observant/mischievous of you might be thinking that the aforementioned mechanism is open to abuse. It certainly is. If an attacker were to, for example, flood every UDP port on a VoIP Provider’s media server with RTP then it would theoretically be possible to disrupt, or even redirect, the audio from that server. A few mechanisms exist to help mitigate this but none of them are a silver bullet.

You might think, for example, that you should only accept RTP from the same IP as the SIP came from. This may cause issues with more complex networks and when CGNATWikipedia: Carrier-grade NAT (CGN or CGNAT), also known as large-scale NAT (LSN), is a type of network address translation (NAT) used by ISPs in IPv4 network design. With CGNAT, end sites, in particular resident... is in use as you cannot always guarantee that IP will remain consistent across different flows.

A fairly common mechanism to reduce the attack vector is to pair monitoring, rate limiting and IPS with some sort of RTP session locking. For example, when you receive 3 valid RTP packets from a given IP/Port then don’t accept RTP packets from another IP/Port until the original stream has stopped for a defined period of time.

The Headache

Following the implementation of some new media servers which use the aforementioned RTP session locking mechanism (3 packets + a timeout of 10 seconds), a small handful of customers reported that a few of their calls didn’t have any audio for the first 10 seconds. This was confusing because monitoring showed that there was RTP flowing in both directions. Upon further inspection, it became clear that the RTP source port on the customer’s side was changing after the first 3 packets were sent. Port changes mid-call are rare but we do see them from time to time as customers’ routers bug out or flip between active/passive devices in a HA pair. However the way that this issue presented itself across multiple customers didn’t stack up.

Debugging

We managed to sort of replicate the issue by running a soft phone behind a Juniper SRX firewall. We could see the UDP source port changed a few times at the start of the call but it stabilised after a few packets and there was no 10 seconds of audio loss, as reported by the customers.

Taking a Wireshark trace locally on the PC running the softphone showed that the softphone itself was not changing the source port, so it must be the NAT router. The Wireshark trace looked like this:

After seeing the ICMP Port Unreachable messages, things started to become a bit clearer. When some NAT devices receive a Port Unreachable, they will tear down the state for that session and the next RTP packet will create a new state (usually from a new source port).

But why was the media server sending ICMP Port Unreachable messages? That is because this particular soft phone starts sending RTP before the call is fully established (i.e. before it sends the 200 OK). The media server is not ready to receive these packets so it hasn’t yet opened up that port.

Root Cause

The way ports are assigned by NAT/PAT routers differs by vendor. Some vendors will assign random ports whereas others will always assign the lowest available port. In the case of the customers who had the reported issue, their routers assigned the lowest available port.

On the whole, this was OK as every time the session got closed by the ICMP message, it was opened again on the same port. But if it just so happens that after the 3rd RTP packet the port that was previously being used was taken by another session on the network… the port would change after the media servers had locked the session. As such, the 10 second timer kicked in and there was no audio for that period of time.

The Solution

The best solution was to just disable the ICMP Port Unreachable messages. If they weren’t sent and the early RTP was silently ignored then the problem wouldn’t exist. 10 seconds of audio loss also seemed like quite a long time on the rare occasion that ports legitimately change mid call, since such a lengthy drop in audio is most likely to cause one party to hang up. This timeout was lowered also to a more acceptable level. Although lowering the timeout does reduce security somewhat, a phone legitimately not sending RTP for 10 seconds whilst an attacker is sending RTP to that same port is reasonably unlikely.

Of course, the increasingly wide-spread adoption of IPv6 removes the complexities presented by NAT though a lot of the problems we currently attribute to NAT, such as pinhole timeouts, still exist within stateful firewalls.

VoIP is Hard 🙁

As a SaaS company, if you employ a strategy of continuous automated testing then you can largely eradicate customer affecting bugs in production. VoIP isn’t quite so simple. You have to contend with a huge variety of phone and network vendors and, sometimes, they just don’t play nicely.