Conversations: The Technology Behind Voice and Video Calls

After having had a look at the data rate of Conversations voice and video calls, the next thing I wanted to know was which underlying technology is used for the audio and video streams. Building such a thing from scratch, including authentication, encryption and overcoming NAT firewalls is a monumental task. So how was it done?

Image: WebRTC and DTLS info in packetThe first place I looked at was the data stream that flows over my TURN server. Thanks to Wireshark one can see pretty quickly what is going on. For the media stream, WebRTC is used, which I wrote about on this blog for the first time back in 2013. The UDP transport layer itself is secured by DTLS (Datagram Transport Layer Security). That makes sense as WebRTC documentation specifically mentions this. That is good news as by now, WebRTC is a mature protocol and used in many other closed and open source products. Jitsi and BBB use it in the browser and in apps for example!

The next place I had a look at for information on how voice and video is implemented is in Conversations’ source code and documentation on Github. There is a nice search function so I searched for “WebRTC” and got all places where the term is used in the source code. This further confirms that Conversations uses WebRTC. When looking through the code snippetsĀ  and documentation pieces found, it becomes clear that Conversations uses Google’s Android WebRTC library and packages it inside its APK. That’s also the reason why the APK size has gone up from around 16 MB to 28 MB when audio/video was added:

**Note:** Starting with version 2.8.0 you will need to compile libwebrtc. [Instructions](https://webrtc.github.io/webrtc-org/native-code/android/) can be found on the WebRTC […]

For call establishment signaling, Conversations uses the Jingle protocol that is described in XEP-0166:

This specification defines an XMPP protocol extension for initiating and managing peer-to-peer media sessions between two XMPP entities in a way that is interoperable with existing Internet standards. The protocol provides a pluggable model that enables the core session management semantics (compatible with SIP) to be used for a wide variety of application types (e.g., voice chat, video chat, file transfer) and with a wide variety of transport methods (e.g., TCP, UDP, ICE, application-specific transports).

So that’s the protocol stack used and I’m glad it’s all based on open source software!