
In my simple universe, I always thought that all apps using Android’s built-in WebRTC stack for voice and video calls would pretty much use the same parameters and hence, the encrypted data stream would pretty much look the same. It turns out, however, that this is not at all the case. This discovery made me go off on quite a bit of a tangent, and I decided to have a look at a number of different related things, including cellular network handover behavior and other things. But let’s start at the beginning, the use of WebRTC by different ‘Over the Top’ voice applications. Personally, I use two messengers: Signal and Conversation and it turns out they use WebRTC quite differently!
The graph above shows the combined transmit and receive data rate of a Signal messenger voice call on the left and the same for a Conversations messenger voice call on the right. Both send encrypted voice packets over the WebRTC protocol, but there are two very interesting differences:
First, the data rate of the Signal voice stream is significantly lower than the data rate used by the Conversations messenger. There might be several reasons for that: One reason could be that Signal and Conversations use the same voice codec but at different data rates. Or they could use different voice codecs altogether. There is no way to see this from the outside, as the voice stream is encrypted. What can be seen from the outside, however, is that Signal sends voice packets every 60 milliseconds, while Conversations uses 20 milliseconds packetization of the voice data. That means that the IP/UDP/WebRTC overhead of Signal compared to the bits that carry the voice data is lower than the overhead during a voice call with Conversation. This comes at the expense of an increased speech delay of 40 ms.
The other thing that is visible in this graph is that the data rate during silence periods on the voice channel is reduced much more heavily by Signal than it is by Conversations. Both show dips in the datarate while I am silent, so both have silence detection, but it seems that this is encoded in different ways. Another possible reason for this could be that Signal’s silence detection is more aggressive. Again, hard to tell from the outside, other than that there is an observable difference.
So much for today. In the next post on the topic, I’ll have a look at this 20 vs 60 ms difference and how to visualize this.