Instagram is the most popular photo-oriented social network on the planet today. With over a billion users, it has become the first choice for businesses to run their marketing campaigns on.

This write-up is a deep dive into its platform architecture & addresses questions like what technologies does it use on the backend? What are the databases that the platform leverages? How does it store billions of photos serving millions of QPS queries per second? How does it search for content in the massive data it has? Let’s find out.

For a full list of all the real-world software architecture posts on the blog here you go.


1. What Technology Does Instagram Use on the Backend?

The server-side code is powered by Django Python. All the web & async servers run in a distributed environment & are stateless.

The below diagram shows the architecture of Instagram

Instagram architecture 8bitmen.com
Image source: Instagram

The backend uses various storage technologies such as Cassandra, PostgreSQL, Memcache, Redis to serve personalized content to the users.

Asynchronous Behavior

RabbitMQ & Celery handle asynchronous tasks such as sending notifications to the users & other system background processes.

Celery is an asynchronous task queue based on distributed message communication, focused on real-time operations. It supports scheduling too. The recommended message-broker for celery is RabbitMQ.

RabbitMQ, on the other hand, is a popular open-source message broker written using the AMQP Advanced Messaging Queuing Protocol.

Gearman is used to distribute tasks across several nodes in the system. Also, for asynchronous task handling such as media uploads etc. It’s an application framework for distributing tasks to other machines or processes that are more fit to execute those particular tasks. It has a gamut of applications ranging from high available websites to the transport of database backup events.


Zero to Software/Application Architect learning track is a series of four courses that I am writing with an aim to educate you, step by step, on the domain of software architecture & distributed system design. The learning track takes you right from having no knowledge in it to making you a pro in designing large scale distributed systems like YouTube, Netflix, Google Stadia & so on. Check it out.


Computing Trending HashTags On The Platform

The trending backend is a stream processing application that contains four nodes/components connected linearly.

Instagram trending hashtag architecture
Image Source: Instagram

The role of the nodes is to consume a stream of event logs and produce the ranked list of trending content I.e. hashtags and places.

Pre-processor Node

The pre-processor node attaches the necessary data needed to apply filters on the original media that has metadata attached with it.

Parser Node

The parser node extracts all the hashtags attached with an image and applies filters to it.

Scorer Node

Scorer node keeps track of the counters for each hashtag based on time. All the counter data is kept in the cache, also persisted for durability.

Ranker Node

The role of the ranker node is to compute the trending scores of hashtags. The trends are served from a read-through cache that is Memcache & the database is Postgres.

To educate yourself on software architecture from the right resources, to master the art of designing large scale distributed systems that would scale to millions of users, to understand what tech companies are really looking for in a candidate during their system design interviews. Read my blog post on master system design for your interviews or web startup.


Databases Used @Instagram

PostgreSQL is the primary database of the application, it stores most of the data of the platform such as user data, photos, tags, meta-tags etc.

As the platform gained popularity & the data grew huge over time, the engineering team at Insta meditated on different NoSQL solutions to scale & then finally decided to shard the existing PostgreSQL database as it best suited their requirements.

Speaking of scaling the database via sharding & other means, this article YouTube Database – How Does It Store So Many Videos Without Running Out Of Storage Space? is an interesting read.

So, the main database cluster of Instagram contains 12 replicas in different zones & involves 12 Quadruple extra large memory instances.

Hive is used for data archiving. It’s a data warehousing software built on top of Apache Hadoop for data query & analytics capabilities. A scheduled batch process runs at regular intervals to archive data from PostgreSQL DB to Hive.

Vmtouch, a tool for learning about & managing the file system cache of Unix & Unix like servers, is used to manage in-memory data when moving from one machine to another.

Using Pgbouncer to pool PostgreSQL connections when connecting with the backend web server resulted in a huge performance boost.

Redis an in-memory database is used to store the activity feed, sessions & other app’s real-time data.

Memcache an open-source distributed memory caching system is used for caching throughout the service.


Data Management in the Cluster

Data across the cluster is eventually consistent, cache tiers are co-located with the web servers in the same data centre to avoid latency.

The data is classified into global & local data which helps the team to scale. Global data is replicated across different data centres across the geographical zones. On the other hand, the local data is confined to specific data centres.

Initially, the backend of the app was hosted on AWS Amazon web services but later migrated to Facebook data centres. That eased the integration of Instagram with other Facebook services, cut down latency & leverage the frameworks, tools for large scale deployments built by the Facebook engineering team.


Monitoring

With so many instances powering the service, monitoring plays a key role in ensuring the health & availability of the service.

Munin is an open-source resource, network & infrastructure monitoring tool used by Instagram to track metrics across the service & get notified of any anomalies.

StatsD a network daemon is used to track statistics like counters and timers. Counters at Instagram are used to track events like user signups, number of likes etc. Timers are used to time the generation of feeds & other events that are performed by users on the app. These statistics are almost real-time & enable the developers to evaluate the system & code changes immediately.

Dogslow a Django middleware is used to watch the running processes & snapshot is taken of any process taking longer than the stipulated time by the middleware and the file is written to the disk.

Pingdom is used for website’s external monitoring, ensuring expected performance & availability. PagerDuty is used for notifications & incident response.

 Now let’s move on to the search architecture.


How Does Instagram Runs A Search For Content Through Billions of Images?

Instagram initially used Elasticsearch for its search feature but later migrated to Unicorn, a social graph aware search engine built by Facebook in-house.

Unicorn powers search at Facebook & has scaled to indexes containing trillions of documents. It allows the application to save locations, users, hashtags etc & the relationship between these entities.

Speaking of the Insta’s search infrastructure it has denormalized data stores for users, locations, hashtags, media etc.

These data stores can also be called as documents, which are grouped into sets to be processed by efficient set operations such as AND-OR & NOT

The search infrastructure has a system called Slipstream which breaks the user uploaded data, streams it through a Firehose & adds it to the search indexes.

The data stored by these search indexes is more search-oriented as opposed to the regular persistence of uploaded data to PostgreSQL DB.

Below is the search architecture diagram

Instagram search architecture 8bitmen.com
Image source: Instagram

If you aren’t aware of Hive, Thrift, Scribe. Do go through this write-up what database does Facebook use? A deep dive. It will give you an insight into how Facebook stores user data.

Recommended Read: Master System Design For Your Interviews Or Your Web Startup


ReadBest resources to learn software architecture & system design
I’ve put together a list of resources (online courses + books) that I believe are super helpful in building a solid foundation in software architecture & designing large scale distributed systems like Facebook, YouTube, Gmail, Uber & so on.  


Subscribe to the newsletter to stay notified of the new posts.

Your subscription could not be saved. Please try again.
Please check your inbox to confirm the subscription.

Subscribe to my newsletter

Get updates on the new content published on the blog by joining my newsletter


If you liked the article, share it with your folks. You can follow 8bitmen on social media to stay notified of the new content published – 
Twitter     
Facebook
LinkedIn 

I am Shivang, here is my LinkedIn profile in case you want to say Hello!


More On the Blog

Web Application Architecture & Software Architecture 101 Course

Data Analytics in E-Sports – Future Prospects – Jobs – Everything You Should Know

An Insight Into How Uber Scaled From A Monolith To A Microservice Architecture

How Does PayPal Process Billions of Messages Per Day with Reactive Streams?

How Hotstar scaled with 10.3 million concurrent users – An architectural insight