Facebook Real-time Chat Architecture Scaling With Over Multi-Billion Messages Daily
This is my second write-up on Facebook architecture. In the first, I covered the databases, the persistence tech used @Facebook.
In this write-up, I will talk about the real-time chat architecture of Facebook which scales with over multi-billion messages sent every single day.
What technologies does it use on the backend? What is the technology stack? What is the system architecture? What are the primary challenges a platform like Facebook, with such a massive number of users, has to face when rolling out & scaling a feature like this?
Let’s find out.
It all started at a hackathon where a few Facebook engineers wrote a chat prototype & presented it before their team. The feature was pretty basic with bare minimum features, the chat box floated on & could be dragged around the web page, it also persisted through page reloads and other page navigations.
Engineers at Fb took the prototype, evolved it into a full-blown real-time chat feature, one of the most heavily used features in the Facebook services ecosystem.
It facilitates billions of messages sent every single day all across the world. The engineering team of the social platform has scaled it pretty well with response time as less than 100 ms.
The feature is continually improved with the sole aim of providing a top-notch communication service to the users.
2. Why Write & Maintain the Chat Feature from Scratch? Why not Integrate a third-party Chat Service?
Besides having the generic features, the chat module is integrated with the Fb social graph. Users can easily pull out the list of their friends, other relevant information such as games they are playing and stuff.
All the information which is available to a user on the platform, in general, is also accessible on the chat module.
It’s easier, cleaner, secure & provides more control when writing things from scratch as opposed to making it work with third-party code.
3. Real-Time Chat Architecture & Technology Stack
The entire system consists of several loosely coupled modules working in conjunction with each other such the web tier, user interface, chat logger, user presence module & the channel cluster.
Long opened persistent connections are established between the client and the server with the help of Ajax.
Flash was dismissed purely due to two reasons. First, it would ask the users to install a plugin in their browsers which is not a good user experience. Second, Flash is not a preferred choice from a security standpoint.
The message fetch flow is a mix of PULL & PUSH based HTTP models.
Initially, the client sends a PULL request to get the first snapshot of the messages, at the same time subscribing to delta updates which is a PUSH based approach.
Once the user subscribes to the updates, the Facebook backend starts pushing the updates to the client whenever new updates are available.
Backend Web Tier
The web tier is powered by PHP. It deals with the vanilla web requests. Takes care of user authentication, friend’s privacy settings, chat history, updates made by friends & other platform features business logic.
User Presence Module
This module provides online availability information of the connections/friends of a user. It’s written in C++ & is the most heavily pinged module of the system.
The module aggregates the online info of the users in-memory & sends the information to the client when requested.
Channel Servers take care of message queuing and delivery. The functionality is written using Erlang.
Erlang is a concurrent functional programming language used for writing real-time scalable & highly available systems like instant messaging, fintech apps, online telephony etc.
The run-time system for Erlang has built-in support for concurrency, distribution & fault-tolerance.
The channel servers leverage Mochi Web library. It is an Erlang library for building lightweight HTTP servers. The messages sent by users are queued in the channel servers. Each message has a sequence number to facilitate synchronous communication between any two or more users.
Logging of chat meta & other information is done via the chat logging module. It’s written in C++ & logs information between UI page loads.
4. Service Scalability & Deployment
User Presence & Chat logging data is replicated across all of the data centres at Facebook while the Channel servers data is stored at just one dedicated data centre to ensure a strong consistency of messages.
All the backend modules are loosely coupled as you can see the diagram above. They communicate with each other via Thrift.
Thrift is a communication protocol which facilitates communication between services running on heterogeneous technologies.
The Most Resource Intensive Operation
The most resource-intensive operation in the entire system is not sending billions of messages across but keeping the user informed about his connections/friends online status.
This is important as a person would begin a conversation only when he sees a connection of his online.
To achieve this, one option was to send notifications to the users of their connections being online. But this process wasn’t scalable by any means considering the number of users the platform has.
This operation has a worst case complexity of O(average number of friends users have * number users at the time of peak traffic * frequency of users going offline & re-connecting online) messages/second.
During the peak hours, the number of concurrent users on the site is in several million. Keeping all the user presence information up to date was technically just not feasible.
Besides the users who weren’t even chatting put a lot of load on the servers by just asynchronously polling the backend for their connections’ active status.
To scale the user presence backend, the cluster of channel servers keeps a record of users available to chat which it sends to the presence servers via regular batch updates.
The upside of this process is with only one single query the entire list of a user’s connections who are available to chat can be fetched.
Considering the crazy amount of information exchanged between the modules, the channel servers compress all the information before streaming it to the presence servers.
The number of load balancers was increased to manage the sheer number of user connections. The ability of the infrastructure to manage concurrent user connections increased significantly after this. This was one bottleneck which caused chat service outages on & off for a while at peak times.
5. Synchronization Of Messages & Storage
To manage synchronous communication, as I stated earlier, every message has a sequence number.
Besides this Facebook created a Messenger Sync Protocol that cut down the non-media data usage by 40%. This reduced the congestion on their network, getting rid of errors happening due to that.
The engineering team at Facebook wrote a service called Iris, which enables the message updates to be organized in an ordered queue.
The queue has different pointers to it which help track the message updates that have been read by the user & the ones still pending.
The recent messages are sent from Iris’s memory and the older conversations are fetched from the traditional storage. Iris is built on top of MySQL & Flash memory.
Hbase was used initially as the messenger storage but later the storage was migrated to MyRocks. It’s an open-source database project written by Facebook that has RocksDB as a MySQL storage engine.
For more information on this read, what databases does Facebook use?
6. More On the Blog
Well, Guys!! This is pretty much it on the real-time chat messenger architecture of Facebook. If you liked the article, share it with your folks.
You can subscribe to 8bitmen on social media & the browser notifications to stay notified of any new content on the blog.
I’ll see you in the next write-up.
- Product Development Roadmap – Insights, By A Developer, Into The Process Of Developing New Products From The Bare Bones
- Are You Certain You Need A Mobile App For Your Business? – Bootstrapping Your Startup
- Google Databases: How Do Google Services Store Petabyte-Exabyte Scale Data?
- YouTube Database – How Does It Store So Many Videos Without Running Out Of Storage Space?
- Web Application Architecture & Software Architecture 101 Course