Parsing a complete digital film is an immensely complex task. Because you mostly ask about WebM – a container format – I’ll concentrate on that.
You always start with individual streams containing the payload data: video (e.g. H.264, VP9), audio (e.g. AAC, Opus) and subtitles (e.g. SubRip, Blu-ray PGS). Tied to those streams is some metadata needed for correct playback. For example the streams need to be synchronized properly.
As a simple example imagine a WebM file containing a VP9 video stream and an Opus audio stream.
The WebM container acts as a wrapper for the VP9 and Opus streams that makes it possible to put them into a single file and still access them conveniently. Also it contains additional data like the types of streams it contains or checksums for error recovery.
Naively you could store the streams one after the other in a single chunk each. Obviously that’s a horrible solution for streaming because you’d have to buffer the complete file before playback can start. That’s one reason why streams are interleaved. The file stores a small chunk of video followed by a small chunk of audio (maybe half a second each) and repeats that pattern throughout the file.
What do you need to parse such a file?
- A WebM parser to process the container and extract the payload streams.
- A VP9 parser (probably as a part of a full VP9 decoder) to process the video stream.
- An Opus parser (again probably as a part of a full Opus decoder) to process the audio stream.
WebM is a subset of Matroska. You can get a full specification of the format on the Matroska website. The parser you link to seems extremely simplistic on first glance, but it might be a good enough starting point. For a complete implementation you should have a closer look at the reference parser: libmatroska. It’s used for example in the de-facto standard Matroska mixingmuxing application MKVMerge.
Btw: “Muxing” is short for “multiplexing”. The long form is rarely used, though.