In this article, I’ve put together a list of resources that I believe are super helpful in building a solid foundation in software architecture and designing large-scale distributed systems like Facebook, YouTube, Gmail, Uber and such.  

I’ll start with the courses and then will move on to talk about the books.


Affiliate Disclaimer: Some resources stated in this article contain affiliate links. That means if you find these resources helpful and worthy of spending your money on, and you buy them, I get a small cut without you paying anything extra.

I recommend these resources to you because I think the content they offer is pretty good and these will assist you big time in upskilling yourself, enabling you to soar in your career.


Featured Platforms/Courses

CodeCrafters

CodeCrafters lets you build tools like Redis, Docker, Git and more from the bare bones. With their hands-on courses, you not only gain an in-depth understanding of distributed systems and advanced system design concepts but can also compare your project with the community and then finally navigate the official source code to see how it’s done. 

Get 40% off with this link.


Zero to Mastering Software Architecture

Zero to Mastering Software Architecture is a learning path comprising a series of three courses I have authored intending to educate you, step by step, on the domain of software architecture, cloud infrastructure and distributed system design.

This learning path offers you a structured learning experience, taking you right from having no knowledge on the domain to making you a pro in designing web-scale distributed systems like YouTube, Netflix, ESPN and the like.


Mongo DB University

Free MongoDB courses – practice your skills with hands on labs and quizzes, and earn MongoDB certification. Learn in your programming language of choice with Node, Python, C#, PHP and Java developer courses.


Neo4J Graph Academy

Master Neo4j (a graph database) with free, hands-on courses. Learn how to read from and write to Neo4j, including the more advanced cypher functionality, APOC, and everything in between.

The Platform includes Neo4j Graph Data Science – the leading enterprise-ready analytics workspace for graph data – the graph visualization and exploration tool Bloom, the Cypher query language, and numerous tools, integrations and connectors to help developers and data scientists build graph-based solutions with ease.


GitHub Repo

CDN Up & Running

With this repo, understand how CDNs work by coding one from scratch. The CDN they are designing uses: Nginx, Lua, Docker, Docker-compose, Prometheus, Grafana, and Wrk.

They start with a single backend service and expand from there to a multi-node, latency simulated, observable, and testable CDN. In each section, there are discussions regarding the challenges and trade-offs of building/managing/operating a CDN.


Research Papers

Efficiently Archiving Photos under Storage Constraints

This paper addresses the data storage problem in the context of image data (photos) by proposing which photos to archive to meet an online storage budget. The decision is based on factors such as usage patterns and their relative importance, the quality and size of a photo, the relevance of a photo for a usage pattern, the similarity between different photos, as well as policy requirements of what photos must be retained.


Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System

This paper, presents an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations to ensure the continuous availability of the hardware in large-scale internet services that run on a fleet of distributed servers.


A Design Framework for Highly Concurrent Systems

This paper presents a general-purpose design framework for building highly concurrent systems, based on three design components — tasks, queues, and thread pools — which encapsulate the concurrency, performance, fault isolation, and software engineering benefits of both threads and events.

It also contains a discussion on a set of design patterns that can be applied to map an application onto an implementation using these components.


The Tail at Scale

This article outlines some of the common causes of high latency episodes in large online services and describes techniques that reduce their severity or mitigate their impact in whole system performance. In many cases, tail-tolerant techniques can take advantage of resources already deployed to achieve fault-tolerance, resulting in low additional overheads. We show that these techniques allow system utilization to be driven higher without lengthening the latency tail, avoiding wasteful over-provisioning.


Books

Building Secure & Reliable Systems – Google SRE

This ebook provides insights about system design, implementation, and maintenance from practitioners who specialize in security and reliability. Targets folks who design, implement and maintain systems.

Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. The book encourages us to think about the fundamentals of reliability and security from the very beginning of the development process and integrating those principles early in the system lifecycle.


Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing data-intensive applications by Martin Kleppmann is one of the best sellers in the domain of designing large-scale applications. This book helps you understand the pros & cons of picking different technologies for processing and storing data in your application. It discusses the fundamentals of data processing and also takes a deep dive into concepts like scalability, high availability, consistency, reliability, different kinds of databases, distributed systems and more.

If you work on the backend, deal with databases to store data when developing mobile apps, web apps and such, if you want to understand how to make data systems scalable, this book will help you big time in developing a good foundation in large-scale system design.


The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise

The Art of Scalability is written by industry consultants that educate you on how to scale products and services for different requirements. The authors discuss case studies from their consulting practice giving the readers insights into cloud transitions, NoSQL, DevOps, business metrics, measuring availability, capacity, load and performance and more. The insights and recommendations of the authors reflect more than thirty years of experience at companies from eBay, Visa, Salesforce and Apple.


Web Scalability For Startup Engineers

This book discusses core concepts and best practices for developing scalable applications in a startup environment. It describes how infrastructure and software architecture blend together when building scalable systems. The book also contains diagrams and real-world examples to help understand the concepts better.

Readers of this book will learn the key principles of software design for scalable systems, concurrency and throughput, designing APIs, implementing caching, how to leverage asynchronous processing, messaging, event-driven architecture and more.


Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing

Data Streaming, both in real-time and in batches is a key component in modern web applications. This book helps the readers understand the underlying architecture and fundamentals of streaming systems, right from the introductory level of how data processing streams function. This is a practical guide with real-world examples for software developers, data engineers and data scientists on how to work with streaming data in a conceptual and platform-agnostic way.


Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale

Architecting modern data platforms contains in-depth information on big data technologies. It takes a practical approach to educate the reader on how to build big data infrastructure both on-premises and in the cloud.

It walks you through different component layers in a modern data platform and also on concepts like high availability, disaster recovery, deployment, operations, security and more.


Database Internals: A Deep Dive Into How Distributed Data Systems Work

Database internals as the title says takes a deep dive into how distributed data systems work. This book is a practical guide to the concepts behind modern databases and the internals of their storage engines. You’ll understand how storage is organized and how the data is distributed across the system.

The book talks about storage engines explaining concepts like storage classification, B-Tree based & immutable log-structured storage engines with their respective use cases. How database files are organized to build efficient storage using data structures such as Page Cache, Buffer Pool & Write-ahead Log. You’ll learn how nodes and processes work in conjunction with each other in distributed systems, how data consistency models work and so on.


Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services

Designing distributed systems discusses patterns used in the development of reliable distributed systems. The author who is the director of engineering at Microsoft Azure explains how we can adapt existing software design patterns for designing and building reliable distributed applications. System engineers and application developers will learn how they can improve the quality of their systems using the patterns discussed in the book.

The book also touches upon the distributed system patterns for large-scale batch data processing involving work queues, event-based processing and coordinated workflows.


Building Microservices: Designing Fine-Grained Systems

This book educates the reader on the techniques of modeling, integrating, testing, deploying and monitoring a microservice. All the concepts are discussed with the help of an example of a fictional company.

The book discusses key concepts & challenges involved in scaling the microservices architecture, managing security with the user-to-service and service-to-service models, dealing with complexities of testing and monitoring distributed services, deploying microservices through continuous integration, splitting monolithic codebases into microservices and more.


Microservice Architecture

Microservice Architecture discusses the right way to approach microservices architecture. It discusses technologies and methodologies involved in building microservices from the ground up along with the experiences of large-scale services that have adopted microservices architecture.

The book is split into three parts that discuss –

How microservices work & what it means to build a system using the microservices architecture.

A design-based approach for implementing the microservices architecture.

Best practices on how to handle the challenges of introducing the microservices architecture in your organization.


Site Reliability Engineering – How Google Runs Production Systems

The site reliability engineering book discusses the entire application deployment lifecycle that includes building, deploying, monitoring and maintaining the services at Google. Readers will learn the principles and practices that enable Google engineers to make their services more scalable, reliable and efficient.

The book is split into four parts – where the first part gives an introduction to SRE site reliability engineering and how it differs from traditional IT practices. The other two parts talk about the patterns and behavior involved in the day-to-day work of an SRE engineer when building and operating large-scale distributed computing systems. The last part touches upon Google’s best practices for running its infrastructure.

You can read the book online here.


This list of software engineering resources will be continually updated as I find new quality resources in the domain.

To stay notified of new developments, subscribe to my newsletter: