YouTube Architecture – How Does It Serve High-Quality Videos With Low Latency
Earlier I wrote an article on how YouTube stores petabyte-scale data every single day. How does it store so many videos without running out of storage space? This write-up is an insight into the video processing & delivery architecture of YouTube. How does it serve high-quality videos with low latency?
So, without further ado. Let’s get on with it.
Below is the diagram for the high-level backend architecture of the video-sharing platform for reference. I’ve discussed it in the YouTube data storage article, I’ve linked above.
A key element in the process of storage and delivery of high-quality videos on YouTube is video transcoding. When a video is uploaded on Youtube it’s first transcoded from its original format to a temporary intermediate format to facilitate the conversion of the content in different resolutions & formats. More on this later, before that let’s have an insight into what video transcoding is?
Video transcoding is a technique of converting a video into multiple different formats and resolutions to make it playable across different devices and bandwidths. The technique is also known as video encoding. This enables YouTube to stream videos in different resolutions such as 144p, 240p, 360p, 480p, 720p, 1080p & 4K.
Delivery of content based on the network bandwidth and the device type of the end-user is known as adaptive streaming. Over the years YouTube has excelled at this. The idea is to reduce the buffering as much as possible.
Imagine streaming a 4K video at only its original resolution, providing no lower resolutions. Without adaptive streaming, there is no way viewers with low bandwidth network can watch that 4K stream. This is definitely not an end-user experience anyone would want on a platform. Isn’t it?
Large video files are compressed into smaller size with the help of codecs. Codecs contain efficient algorithms that compress a video into smaller sizes. One of the most common widely used video codecs used today is H264. According to Wikipedia, this codec is the video compression standard and is used by over 91% of the video industry developers.
Lossless & Lossy Transcoding
Transcoding is possible in two ways lossless and lossy. You may have heard of these terms associated with data/image compression. Lossless means during transcoding from the original format to a new format, there is no loss of data. This means the new format video will also be, mostly, of the same size as the original video since there is no loss of data.
In the Lossy approach, some data is dropped from the original video in order to reduce the size of the new format. Lost data cannot be regained. It’s gone forever. You’ve might have experienced this when you upload a high-resolution DSLR camera image on a social network and after the upload, the image doesn’t look as good and detailed as the original image.
This is for one simple reason; the platform compressed the image losing some data from it in order to reduce it in size so that your connections can easily view it without experiencing any sort of download latency. Even videos are a series of still image frames. When you render an animation video the animation software generates the animation in frames and then we add all the frames together to create the final video.
Costs Associated With Video Transcoding
When a video is converted into multiple formats and resolutions, all the different versions need to be separately stored in the database. This has storage costs. Also, sophisticated codec algorithms that convert these videos into different resolutions have high computational costs. And then switching between different resolutions based on the client’s network bandwidth in real-time has network delivery costs. YouTube’s video encoding pipeline keeps a balance between the three factors. Additional efforts for further compressing the videos are made by the platform only for the highly popular videos.
Okay, let’s move on the video upload and rendering flow.
To educate yourself on software architecture from the right resources, to master the art of designing large scale distributed systems that would scale to millions of users, to understand what tech companies are really looking for in a candidate during their system design interviews. Read my blog post on master system design for your interviews or web startup.
YouTube’s Video Delivery Architecture
All the videos that are uploaded to YouTube are first transcoded into multiple different formats and resolutions set by the platform. The video during the transcoding process is broken down into segments and converted into multiple different resolutions. The processing of multiple segments is spread across multiple machines to parallelize the process thus increasing the throughput. If a video goes viral, it is subject to another round of video compression. This second round of compression ensures the same visual quality of the video at a much smaller size.
When encoding videos YouTube chooses a bitrate within the limits that the codec allows. A video with a high bitrate has better quality but there is a sweet spot beyond which even on increasing the bitrate there is not a significant visual improvement in the video quality though the video size increases in the process.
YouTube also takes into account the playback statistics of the video player on the client’s device like how often the player switches to a lower resolution with respect to the steaming bandwidth available on the client & so on. It then predicts if a better resolution content can be pushed to the user, being aware of his streaming bandwidth limit. This performance estimation enabled YouTube to cut down on the bandwidth, increasing user engagement on the platform simultaneously. More on this here & here.
For encoding it’s videos YouTube uses VP9 – an open-source codec that compresses videos with HD & 4K quality at half the bandwidth used by other codes.
Once the videos are transcoded and stored in the database, they are spread across the cache network of the platform. When the user requests a video, the platform checks the viewer’s device type, screen size, processing capability, the network bandwidth and then delivers the fitting video version in real-time from the nearest Edge location.
When the video is streamed LIVE, it’s not really LIVE. There is a slight processing delay required by the transcoding and other processes involved in the video delivery. Also, it’s tricky to cache LIVE videos across the network since it’s being streamed LIVE. I’ll discuss that in my next write-up in the architecture series.
You can subscribe to my email newsletter to stay notified of the new content published on the blog.
Datacenter as a computer – A whitepaper by Google research.
I am Shivang. You can read more about me here.
More On The Blog
- Distributed Systems & Scalability #1 – Heroku Client Rate Throttling
- Zero to Software/Application Architect – Learning Track
- Java Full Stack Developer – The Complete Roadmap – Part 2 – Let’s Talk
- Java Full Stack Developer – The Complete Roadmap – Part 1 – Let’s Talk
- Best Handpicked Resources To Learn Software Architecture, Distributed Systems & System Design