Hello there!! How is it going?
Welcome to 8bitmen.com
This write-up is a comprehensive insight, like a thousand feet deep dive into the persistence layer of Facebook& the first writeup of the real-life software architecture series. What database does Facebook use? Is it a NoSQL or a SQL DB? Is it just one or multiple databases? Or a much more sophisticated polyglot database architecture? How is Facebook’s backend handling billions of data fetch operations every single day?
This article will help us understand everything about it.
So, without any further ado.
Let’s get started.
1. What Database Does Facebook Use?
If you are really keen to jump right to the databases. I’ll begin with the short answer first.
MySQL is the core, main database used by Facebook for storing all the social data. It started with the InnoDB MySQL database engine & then wrote MyRocksDB, which was eventually used as the MySQL Database engine.
Memcache sits in front of MySQL & is used for caching.
To Handle BigData Facebook uses Apache Hadoop, HBase, Hive, Apache Thrift, PrestoDB. All these are used for data ingestion, warehousing & running analytics.
Apache Cassandra is used for the inbox search.
Beringei & Gorilla, a high-performance time series storage engines for infrastructure monitoring.
LogDevice, a distributed data store for storing logs.
Now let’s delve into the specifics of things.
1.1 Facebook Has A PolyGlot Persistence Architecture
Facebook, today, is not a monolithic architecture. It might have been at one point in time just like LinkedIn was but definitely not today.
Facebook as a whole, consists of several different loosely coupled components plugged in together like Lego blocks.
For instance, photo sharing, messenger, social graph, post etc. are all different microservices loosely coupled running in conjunction with each other.
And each & every component has a separate persistence layer to keep things sorted out, easy to manage. Technically we can call this as an abstraction of components.
Now all these different services are plugged together & run in conjunction, still if one goes down it doesn’t take the whole system down. Others are up, unimpacted. Also, not affecting the giant portal as a whole, much.
Facebook is a Polyglot Persistence System.
1.2 What is a Polyglot Persistence System?
Put in very simple words, Polyglot Persistence is using different databases having different unique features for different business requirements.
For instance, Cassandra serves different requirements than a traditional MySQL Db. Redis or Memcache serve different purposes.
If we have ACID requirements like for a financial transaction, MySQL Db would serve best. On the other hand, when we need to access data really fast, we would go for Memcache or when we are not so much concerned about data consistency & duplicate data but need a highly available, eventually consistent persistence system, a NoSQL solution would fit best.
So, now we kind of have a clue that there is no silver bullet. Well, naturally we aren’t hunting werewolves here. We are put upto a much harder task. Connecting the world online.
Similarly, Facebook too uses several different databases to fulfil its different data model & persistence requirements.
Also, the polyglot persistence system is a pretty detailed topic & needs a dedicated write-up for itself.
1.3 Does Facebook Use A Relational Database System?
The core database which Facebook uses is MySQLwith different engines. Different engines? I’ll get to that.
Facebook leverages the social graph to track & manage all the user events on the portal. Such as who liked whose post. Mutual friends. Which of your friends already ate at the restaurant you visited etc. And the social graph is powered by MySQL.
Initially, Facebook engineering team started with MySQL InnoDB engine to persist the social data. Data was compressed & stored even then it just took too much space. And more space usage meant more hardware usage which naturally spiked the data storage costs.
InnoDB MySQL Storage Engine
InnoDB is a default MySQL storage engine providing high reliability & performance.
MyRocks MySQL Storage Engine Written by Facebook
To deal with the space issues. The engineering team at Facebook wrote a new MySQL database engine MyRocks. Which reduced the space usage & improved the write-efficiency by a large extent. It helped cut their storage usage by 50%.
Over time they migrated their user-facing database from InnoDB to MyRocks. Migration wasn’t proved to be that difficult since the core tech MySQL was the same. Just the DB engines changed.
After the migration, for a long while, the engineering team ran the data consistency verification checks to check if everything was smooth.
Several benchmark processes were run to evaluate the DB performance & the results stated that MyRocks instance size turned out to be 3.5 times smaller than the InnoDB instance uncompressed & 2 times smaller than InnoDB instance compressed.
MySQL is the most popular persistence technology ever & is naturally deployed by big guns.
WebScale SQL is a collaboration amongst engineers from several different companies such as Google, Twitter, LinkedIn, Alibaba running MySQL in production at scale to build & enhance MySQL features that are required to run in large scale production environments.
Facebook has one of the largest MySQL deployments in the world. And it shares a common WebScale SQL development codebase with the other companies.
The engineering team is Facebook is preparing to move it’s production tested versions of table, user & compression statistics into WebScaleSQL.
2. RocksDB: A Persistent Key-Value Store for Flash and RAM Storage
Initially, Facebook wrote an embeddable persistent key-value store for fast storage called RocksDB. Which being a key-value store had some advantages over MySQL. RocksDB was inspired by LevelDB a datastore written by Google.
It was sure fast but did not support replication or an SQL layer & Facebook engineering team wanted those MySQL features in RocksDB. Eventually, they built MyRocks, an open-source project which had RocksDB as a MySQL storage engine. Pretty geeky!!
RocksDB fits best when we need to store multiple terabytes of data in one single database.
Some of the typical use cases for RocksDB
1. Implementing a message-queue that supports a large number of inserts & deletes.
2. Spam detection where you require fast access to your dataset.
3. A graph search query that needs to scan a data set in real-time.
3. Memcache – Distributed Memory Caching System
Memcache is being used at Facebook right from the start. It sits in front of the database, acts as a cache & intercepts all the data requests intended for the database.
Memcache helps reduce the request latency by a large extent. Eventually providing a smooth user experience.
It powers the Facebook social graph having trillions of objects & connections, growing every moment.
Memcache is a distributed memory caching system, used by big guns in the industry such as Google cloud.
Facebook Caching Model
When a user changes the value of an object, the new value is written to the database & deleted from Memcache. The next time user requests that object, the value is fetched from the database & written to Memcache. Now after this for every request, the value is served from Memcache until it is modified.
This flow appears pretty solid until the database & the cache are deployed in a distributed environment 😊 Now the concept of eventual consistency comes into effect.
The instances of an app are geographically distributed. When one instance, the node of a distributed database is updated in say in Asia, it takes a while for the changes to cascade to all of the instances of the database running. To get a uniform consistent value across all the instances. This is known as eventual consistency.
Now right at the point in time when the value is updated in the Asia instance if a person requests the object from America, he will receive the old value from the cache.
Well, this is typically a tradeoff between high availability & data consistency.
4. How Does Facebook Manage Big Data?
4.1 Apache Hadoop
Facebook has some real crazy amount of data that grows with every moment. Obviously, this shouldn’t come across as a surprise. And the dev team has an infrastructure in place to manage such an ocean of data.
Read this article on data ingestion to understand why it is super important for businesses to manage & make sense of large amounts of data?
Apache Hadoop is the ideal open source utility to manage big data & Facebook uses it for running analytics, storage for a distributed database & for storing MySQL database backups.
Besides Hadoop, there are also other tools like Apache Hive, HBase, Apache Thrift which are used for data processing
Facebook has open sourced the exact versions of Hadoop which they run in production. They have possibly the biggest implementation of the Hadoop cluster in the world. Processing approx. 2 petabyte of data per day in multiple clusters at different data centres.
Facebook messages use a distributed database called Apache HBase to stream data to Hadoop clusters.
Another use case is collecting user activity logs in real-time in Hadoop clusters.
4.2 Apache HBase Deployment at Facebook
HBase is an open source, distributed database, non-relational in nature, inspired by Google’s BigTable. It is written in Java.
Facebook Messaging Component is built on HBase, running on top of HDFS. It was chosen by the engineering team due to the high write throughput & low latency it provided. The other features being a distributed project included horizontal scalability, strong consistency & high availability.
HBase is also used in production by other services such as the internal monitoring system, search indexing, streaming data analysis & data scraping.
4.3 Apache Cassandra – A Distributed Wide-Column Store
Apache Cassandra is a distributed wide-column store built in house at Facebook for the Inbox search system. Cassandra was written to manage structured data & scale to a very large size across multiple servers with no single point of failure.
The project runs on top of an infrastructure of hundreds of nodes spread across many data centres. Cassandra is built to maintain a persistent state in case of node failures. Being distributed features like scalability, high performance, high availability are inherent.
4.4 Apache Hive – Data Warehousing, Query & Analytics
Apache Hive is a data warehousing software project built on top of Hadoop for running data query & analytics.
At Facebook, it is used to run data analytics on petabytes of data. The analysis is used to gain an insight into the user behaviour, it helps in writing new component & services, & understanding user behaviour for Facebook Ad Network.
Hive inside Facebook is used to convert SQL queries to a sequence of map reduce jobs that are then executed on Hadoop. Writing programmable interfaces & implementations of common data formats & types, to store metadata etc.
5. Presto DB – A High Performing Distributed Relational Database
PrestoDB is an open source performant, distributed RDBMS primarily written for running SQL queries against massive amounts, like petabytes of data. It’s an SQL query engine for running analytics queries. A single presto query can combine data from multiple sources, enabling analytics across the organization’s system.
The DB has a rich set of capabilities enabling data engineers, scientists, business analysts process Tera to Petabytes of data.
Facebook uses PrestoDB to process data via a massive batch pipeline workload in their Hive warehouse. It also helps in running custom analytics with low latency & high throughput. The project has also been adopted by other big guns such as Netflix, Walmart, Comcast etc.
The below diagram shows the system architecture of Presto
The client sends SQL query to the Presto co-ordinator. The coordinator then parses, analyzes & plans the query execution. The scheduler assigns the work to the nodes located closest to the data & monitors the processes.
The data is then pulled back by the client at the output stage for results. The entire system is written in Java for speed. Also, it makes it really easy to integrate with the rest of the data infrastructure as they too are written in Java.
Presto connectors are also written to connect with different data sources.
6. Beringei: A High-Performance Time Series Storage Engine
Beringei is a time series storage engine & a component of the monitoring infrastructure at Facebook. The monitoring infrastructure helps in detecting the issues & anomalies as they arise in real time.
Facebook uses the storage engine to store system measurements such as product stats like how many messages are sent per minute, the service stats, for instance, the rate of queries hitting the cache vs the MySQL database. Also, the system stats like the CPU, memory & network usage.
All the data goes into the time series storage engine & is available on dashboards for further analysis. Ideally in the industry Grafana is used to create custom dashboards for running analytics.
7. Gorilla: An In-Memory Time Series Database
Gorilla is Facebook’s in-memory time series database primarily used in the monitoring & analytics infrastructure. It is intelligent enough to handle failures ranging from a single node to entire regions with little to no operational overhead. The below figure shows how gorilla fits in the analytics infrastructure.
Since deployment Gorilla has almost doubled in size twice in the 18-month period without much operational effort which shows the system is pretty scalable. It acts as a write-through cache for monitoring data gathered across all of Facebook’s systems. Gorilla reduced Facebook’s production query latency by over 70x when compared with the previous stats.
8. LogDevice: A Distributed DataStore For Logs
Logs are the primary way to track the bugs occurring in production, they help in understanding the context & writing a fix. No system can run without logs. And a system of the size of Facebook where so many components are plugged in together generates a crazy amount of logs.
To store & manage all these logs Facebook uses a distributed data store for logs called LogDevice.
It’s a scalable and fault-tolerant distributed log system. In comparison to a file system which stores data as files, LogDevice stores data as logs. The logs are record oriented, append-only & trimmable. The project has been written from ground up to serve multiple types of logs with high reliability & efficiency at scale.
The kind of workloads supported by LogDevice are event logging, stream processing, ML training pipelines, transaction logging, replicated state machines etc.
Guys this is pretty much it. I did quite a bit of research on the persistence layer of Facebook & I think I’ve covered all the primary databases deployed at Facebook. If you have any more information regarding the architecture. Please do share.
Also, this article will be continually updated as Facebook’s systems evolve. I will also cover the architecture & flows of different components of Facebook such as the messenger, news feed etc. in separate articles.
More On the Blog
If you liked the article, share it with your folks.
On this blog, I write on real-life software architectures, cloud & the distributed systems. You can subscribe to the browser notifications to be notified of the latest content.
I’ll see you in the next article.
- Latest Software Technology, Programming News, New Programming Languages Feed Vol.2
- Latest Software Technology, Programming News, New Programming Languages Feed Index
- Distributed Data Processing 101 – The Only Guide You’ll Ever Need
- What is Lift & Shift Migration to the Cloud? – An In-Depth Insight
- Distributed Cache 101 – The Only Guide You’ll Ever Need