1 The Bare Fucking Minimum Audio and Video Storage 1.1 Audio Do you need to know audio processing? Fuck no, it's completely irrelevant. 1.2 Video 1.2.1 Frame Coding Video is made up of pictures called frames. A frame can be progressive, where it is one single picture; or it can be interlaced where it is 2 'fields' woven together (every second line belongs to a different field, usually) to produce a single picture. Each field is usually encoded separately though. One of them is called the top, the other is the bottom. Generally the top starts at the top of the picture and contains the first line of pixels. Video is then either coded top field first or bottom field first, and tells you where in the byte stream the frame starts essentially. It is also a little important for field matching, which will be covered in the next section. Video coding is how frames are arranged in a stream. They can be progressive or interlaced, and they can be coded as either progressive or interlaced, even if the actual video is the other kind. What I mean here is that you can have interlaced video stored progressively, or progressive video stored as interlaced. The latter can often be seen when you have 1080i50 bluray footage. You don't need to really understand video coding, just be aware that it can be stored either way, regardless of whether it really is that kind of footage or not. 1.2.2 Frame Rates There are several systems of framerates in use. Framerate is the number of frames shown in a second. This is full frames, either progressive or interlaced. Framerate is expressed in number of frames a second, but timecodes, the way encoded video recognises them, is usually in fractions of a second. For example, 24fps video timecodes go up in increments of 0.040 seconds. The important thing about framerate is that while it needs to match motion, it doesn't need to be constant, and when editing you can almost ignore it entirely because your editor reads individual frames, not motion. It's up to you to not bork your motion, and there are many ways to do this, but if you keep a constant framerate you shouldn't need to worry about it. The main standards for SD are called NTSC and PAL. PAL is used in Australia, New Zealand, and parts of Europe and is 25fps. In interlaced footage, it is half an image at 50fps, called 50i often, and this rate is based on the frequency of the electrical network in these countries. In America, Japan, and most of the rest of the world, the framerate used is 29.970fps, or 1% slower than 30fps. In interlaced content, this is called 60i and is usually 59.94 half frames a second. The 1% slowdown is due to colour information being added at a later date from the original standard. A third framerate called FILM is also important to know about. This is usually 23.976fps but can sometimes be 24fps. Anime is usually handled at this rate, as well as most modern content. However, given the TV networks use NTSC rates, they must be converted. This is done with a process called telecine. The most important thing to know about dealing with framerate conversions such as inverse telecine (IVTC, removing telecine) is to understand how telecine works. Regardless of whether you actually IVTC or not, if you understand it, most other framerate related procedures kinda make sense as you need to 'get' how frames interact to make IVTC work. The idea of telecine is to replicate fields (or frames, but this is really rare) by interlacing the video and then copying one field out of the picture interchangably in a pattern. For example, let's do the standard FILM->NTSC IVTC pattern of 2:3. This is important as it provides a good understanding of the telecine process. The idea is that you need to turn 24 frames into 30 (actually 23.976 into 29.970, but let's ignore the 1% there.) These framerates both have 6 as one of their factors. This means that for any 4 frames, we need to turn them into 5 (4x6=24, 5x6=30.) If frames are encoded with 2 fields for one frame, 3 fields for the next, and repeated, we can do this rather easily. Confused? Let's look at this frame representation. We have 4 frames, A through D. Each of them is made of 2 fields, so we really have frames AA, BB, CC, and DD. But let's say we copy every second frame so that we have blended frames. The pattern then becomes AA, BB, BC, CD, DD. If we were to undo this by IVTC, a process called field matching, we get AA, BB, CC, DD, DD. We now have a duplicate, but the original fields are still present. This duplicate can then be deleted in a process called decimation. Still confused? All you really need to know is that you copy half of every frame either once or twice to make each new frame, so each frame is either 2 or 3 fields, always in that order. There are other kinds of telecine but they are not that important. Perhaps it is also good to note euro telecine as some European localisation companies (DynIt in Italy for example) use this form of telecine to get from FILM to PAL on DVD transfers. Euro telecine is essentially a 2:2:2:2:2:2:2:2:2:2:2:3 pattern, where 12 frames are spread over 50 fields. Another thing to know with telecine is that it can be applied soft, hard, or both. Soft telecine uses something called an RFF (Repeat Field Flag) to tell the playback device if a field should be replicated or not. It can be turned off by instructing the decoder to ignore RFF data. Hard telecine however is encoded directly into the bit stream. This means that the framerate of the actual video is the same as what is displayed, rather than being edited at playback like soft telecine. Hard telecine cannot be turned off and needs processing to remove. You can also have content that uses a mix of both hard and soft, or even a mix of interlaced and progressive footage in a mix of hard and soft so that it is all handled at playback into a single rate. You don't have to understand how this works, just be aware of it and how to identify it, which will be covered later. 1.2.3 Containers vs Codecs One thing a lot of people get confused on is the difference between codecs and containers. A codec is software that uses a particular algorithm to (usually) compress a given media type, whether it is audio, video, or something else. Codec is short for (en)Code/Decode and essentially means the software that either takes and input and then outputs a given format, or deconstructs an input into a raw output that other software can understand. That sounds a bit technical but it really just means compressing and/or decompressing a particular format. A container on the other hand is also a 'format' as such but it is a storage format for one or more streams. It doesn't encode, it just holds. Think of it like a bucket. You can put water in a bucket, or you can encode it to ice, but it still goes in a bucket. The only difference is that another tool, say a hose, won't work with ice, so it needs to be 'decoded' into water again for use. Ice however takes up slightly less space in a way. That's pretty much how it works, with a really terrible metaphore. So the gist here is that when talking of formats, keep yourself clear on what a codec is and where it's used, and when you really mean a container. 1.2.4 Colourimetry, Bit Depth, and Subsampling Here's another thing people get messed up on. Video can be stored as a mapping of colour intensity values of Red, Green, and Blue (RGB), or it can be stored in the more useful and friendlier to the human eye planar format of brightness, cyan-yellow, and red-green (YUV). Video is usually compressed in planar format, as due to how the eye works, this is far more efficient. Displays however need to show 1 red and blue pixel each, and 2 green pixels, to accurately bias for the eye and how to display. This conversion often presents problems, and is fixed by using the correct colour conversion matrix that the content was designed for. The main matrices here are called Rec.601 and Rec.709, but you may see them referred to as BT601 and BT709 occasionally. Although flags to tell you what the content is exist, almost everything ignores them, so you should pay attention to some basic guidelines. Anything that is 576 pixels high or less is 601, while greater than 576 pixels is 709. This essentially means that HD content is 709 and SD is 601. The Rec and BT prefix codes represent the colour range. PC displays have usually got a higher range and can show values of 0-255 (on 8bit content, I'll cover depth later) while TV displays usually only do 16-235. This is of course for older displays as far as TV colourimetry goes; most modern LCD or Plasma displays can show full PC range. Rec codes mean PC range while BT means TV. Keep in mind what your content was made for and it should be clear what is what, but if you're unsure, try both. If there is a difference in the depth of blacks for example, then you have PC range. If it looks the same, then it is TV range, or you're doing your check/conversion wrong. There are other matrices than Rec.709 and Rec.601 but they are few and far between, and more often than not they are the same thing but with a name given by a different standardisation group than the ITU. Bit depth is the number of bits allocated to a pixel value. Usually, content is 8bit, so you are given 256 values. If you have 10bit content however, you get 1024 values per pixel channel/plane, which gives you a much large range for accurate compression and content display. I won't go into how codecs handle this because it's kind of irrelevant, but you should know that in RGB, each pixel is 8 bits per channel, making it 3 bytes, or 4 bytes in the case of an alpha channel (RGBA or RGB32). YUV on the other hand, if raw, is the same size, but is often 'compressed' by a process called subsampling. The idea here is that while the eye is very sensitive to light, it doesn't recognise colour variations nearly as well. Because of this, pixels can be grouped into 2 horizontal pixels per block (YUY2 or Y422), or even 2 vertical blocks into a square of 4 pixels (YV24 or Y420) and still appear mostly the same. YV24 is common and used in almost every piece of footage you will need to access. The way planar video works is by making each colour channel range from two opposing colours. The colours red and green are exact opposites and you can never seen 'red-green', so setting the values of 0-255 as representing this axis is a good way to show these colours. The same goes for yellow and cyan. The colour channels only control hue, brightness is done in an entirely separate channel. Because of this, you can compress the colours to be smoother and still maintain great detail by controlling the brightness more tightly. 1.2.5 Aspect Ratio Seriously, leave this shit alone. If you really need to mess around with it, keep your PAR at 1:1 and do a proper calculation of your DAR. There are good websites for this which are linked all over the place and I really can't be assed to link one here right now hey.