How Digital Video Compression Works
H.264 is a video compression codec standard. It is overwhelmingly used for internet video, phones, Blu-ray movies, security cameras, drone, pretty much everything. It is the final delivery codec that Bonafide Film House uses as well.
H.264 is a impressive piece of technology. It is the result of 30+ years of work with one single goal in mind: To reduce the bandwidth required for full motion video transmission.
It is the result of 30+ years of work with one single goal in mind: To reduce the bandwidth required for full motion video transmission.
The purpose of this post is to give you insight into some of the higher level details of how it works, I will try to not bore you too much.
Why would we want to compress anything in the first place?
A simple uncompressed video is huge. While editing I export pre-rendered videos in Apple ProRes 4.2.2, a 30 minute video is typically 30 gigabytes, that is far too large to deliver to a client. A simple uncompressed video (even larger than the Apple ProRes files I work with) file will contain an array of 2D buffers containing pixel data for each frame. So it’s a 3D (2 spatial dimensions and 1 temporal) array of bytes. Each pixel takes 3 bytes to store – one byte each for the three primary colors (red, green and blue). So a 1080p video at 60 frames a second comes to 370 Megabytes of raw data every second.
This is next to impossible to deal with. A 50GB Blu-ray day would only hold about 2 minutes of video, that’s not going to work.
Imagine you’re building a car for racing. You obviously want to go fast, so what is the first thing you do? You shed some weight. Say your car weights 3000 pounds; You throw away stuff you don’t need. The carpets? Gone. That radio? Get it out of here. Heater? Sure don’t need that. Engine? Probably should keep that. You remove everything except the things that matter.
The idea of throwing away bits you don’t need to save space is called lossy compression. H.264 is a lossy codec – it throw away less important information and only keeps the important bits.
Important stuff? How does H.264 know what’s important?
There are a few obvious ways to reduce the size of images. Maybe the top right of the frame is useless all the time. So maybe we can zero out those pixels and discard that area. At this point we would only be using 3/4th of the space we need. Or maybe we can crop out a thick border around the edges of the frame, the important stuff is happening in the middle of the frame anyways. Yes you could do this as well, but this isn’t how H.264 works.
What does H.264 actually do?
H.264, like other lossy image algorithms discards detail information.
Compare these two images. See how the compressed one does not show the holes in the speaker grills of the MacBook Pro? If you don’t zoom in you probably wouldn’t even know the difference. The image on the right weighs 7% of the original, that is a huge difference already and we haven’t even started.
7%? How did you pull that off?
If you paid attention in your information theory class, you might vaguely remember information entropy. Information entropy is the number of bits required to represent some information that it is not simply the size of some dataset. It is minimum number of bits that must be used to represent all the information contained in a dataset. For example,. if your dataset is the result of a single coin toss, you need 1 bit of entropy. If you have to record two coin tosses, you’ll need 2 bits. Make sense?
Well there you go, you’ve just compressed a large dataset.
Suppose you have a coin, you’ve tossed it 10 times and every time it lands on heads. How would you describe this dataset to someone? You wouldn’t says “HEADS HEADS HEADS HEADS HEADS HEADS HEADS HEADS HEADS HEADS”. You would just says “10 tosses all heads” – Well there you go, you’ve just compressed a large dataset. This is obviously a oversimplification, but you’ve transformed some data into another shorter representation of the same information. You’ve reduced data redundancy. The information of this dataset has not changed. This type of encoder is called a entropy encoder – it’s a general purpose lossless encoder that works for any type of data.
Now that you understand information entropy, let’s move on to transmissions of data. There are some fundamental units that are used to represent data. If you use binary, you have 0 and 1. If you used hex, you have 16 characters. You can easily transform between the two systems. They are essentially equivalent.
Now imagine you can transform any dataset that varies over space – something like the brightness value of an image, into a different coordinate space. So instead of x-y coordinates, let’s say we have frequency coordinates. Frequency X and Frequency Y are the axes now. This is called a frequency domain representation. There is another mathematical theorem that states you can do this for any data and you can achieve a perfect lossless transformation as long as Frequency X and Frequency Y are high enough.
What is Frequency X and Frequency Y?
Frequency X and Frequency Y are another kind of base unit. Just like when we switch from binary to hexcode, we have different fundamental units, we’re switching from the familiar X-Y to Frequency X and Frequency Y. Here is what our image looks like in the frequency domain.
The fine grill of the MacBook Pro has a higher information content in the higher frequency components of that image. Finely varying content + high frequency components. Any sort of gradual variation in the color and brightness, such as gradients are low frequency components of that image. Anything in between falls in between. So fine details equals high frequency and gentle gradients equals low frequency.
In the frequency domain representation, the low frequency components are near the center of the image. The higher frequency components are toward the edge of the image.
Why do all this?
Because now, you can take that image containing all of the frequency domain information and then mask out the edge and discard information which will contain information with high frequency components. Now if you convert back to your regular x-y coordinates, you’ll find that the resulting image looks similar to the original but has lost some of the fine details. But now, the image only occupies a fraction of the space. By controlling how big your mask is, you can now tune precisely how detailed you want your output images to be.
The numbers represent the information entropy of that image as a fraction of the original. Even at 2%, you don’t notice the difference at this zoom level. 2%! Your race car now weights 60 pounds!
So that’s how you shed weight. This process in lossy compression is called quantization.
The human eye and brain system is not very good at resolving finer details in color. It can detect minor variations in brightness very easily, but not color. So there must be some way to discard color information to shed even more weight.
In a TV signal, RGB (red, green, blue) color information gets transformed to Y+Cb+Cr. The Y is the luminance (essentially black darkness and white brightness) and the Cb and Cr are the chrominance (color) components. RGB and YCbCr are equivalent in terms of information entropy.
Why complicate matters? Why not use RGB?
Back before we had color television, we only had the Y signal, and when color TV’s started coming along, engineers had to figure out a way to transmit RGB color along with Y. Instead of using two separate data streams, they decided to encode the color information into Cb and Cr and transmit that along with the Y information. That way, Black and White televisions look at the chrominance components and convert to RGB internally. But here’s the trick: the Y component gets encoded at full resolution. The C components only at a quarter resolution, because the eye and brain is terrible at detecting color variations, you can get away with this. By doing this, you reduce total bandwidth by one half, with very little visual difference. So we’ve reduced the data by half! Your race car now weights 60 pounds.
This process of discarding some of the color information is called Chroma Subsampling. While not specific to H.264 and has been around for decades itself, it is used almost universally.
Those are the big weight shedders for lossy compression. Our frames now tiny – since we discarded most of the detail information and half of the color information.
Can we take this even further?
Yes, in fact we can. Weight shedding is only the first step. So far we’re only looking at the spatial domains within a single frame. Now it’s time to explore temporal compression – where we look at a group of frames across time.
H.264 is a motion compensation compression standard.
Imagine you’re watching a tennis match. The camera is fixed on a certain angle. The only thing moving is the ball back and forth. How would you encode this information? The same method using a 3D array of pixels, two dimensional in space and one in time?
No. Most of the image is the same. The court, the net, the crowds are all static. The only thing moving is the ball. What if you could just have one static image of everything on the background and then one moving image of just the ball. That would save lots of space.
This is exactly what H.264 does. It splits up the image into macro-blocks
This is exactly what H.264 does. It splits up the image into macro-blocks – typically 16×16 blocks that it will use for motion estimation. It encodes one static image – typically called an Intra Frame. This is a full frame containing all the bits it required to construct the frame. And then subsequent frames are either P-frames (predicted) or B-frames (bi-directionally predicted. P-frames are frames that will encode a motion vector for each of the macro blocks from the previous frame. So a P-frame has to be constructed by the decoder based on previous frames. It starts with the last I-frame in the video stream and then walks through every subsequent frame – adding up the motion vector deltas as it goes along until it arrives at the current frame.
B-frames are even more interesting, where the prediction happens bi-directionally, both from past frames and from future frames.Since you’re only encoding motion vectors deltas, this technique is extremely efficient for any video with motion. Now we’ve covered both spatial and temporal compression! So far we had a ton of space saved in Quantization. Chroma subsampling further halved the space required. On top of that we have motion compensation that stores only 3 frames for the 300 we had in that video.
Looks pretty good, now what?
We use a tradition lossless entropy encode to finish it off.
The I-frames, after the lossy steps, contain redundant information. The motion vectors for each of the macro blocks in the P and B-frames – there are entire groups of them with the same values – since serveral macro blocks move by the same amount when the image pans in our video test.
An entropy encoder will take care of the redundancy. And since it is a general purpose lossless encoder, we don’t have to worry about what tradeoffs it’s making. We can recover all the data that goes in.
And we’re done! At the core of it, this is how H.264 works. These are it’s tricks.
I am massively oversimplifying several decades of intense research in this field. If you want to know more, the Wikipedia Page is pretty descriptive.
By Justin Kietzman
Director and Editor at Bonafide Film House
Published September, 1 2016
- Nyquist–Shannon sampling theorem
- Quantization (signal processing)
- Chroma Subsampling
- SSD Benchmarks
- RGB Color Space
- History of Television