KeyJ’s Blog : Blog Archive (original) (raw)

Last week, I read a paper on how to partially encrypt MPEG Audio data. That is, modify an existing audio file that it is still syntactically correct, but sounds more or less broken. For example, imagine an online music shop that offers free, but partially encrypted music downloads: The files are in bad quality, and you have to pay to restore the full fidelity. But I digress.
The point is: that paper was inspiring. I decided to try the presented method using MPEG-1 Audio Layer II (»MP2«) as a basis. I chose this format because it’s the simplest audio compression scheme that is still in broad use today (for example VCD/SVCD, DAB and most prominently DVB). Layer III (»MP3«), AAC and Vorbis are considerably more complex. And, it just so happened that I got a copy of ISO 11172-3 (MPEG-1 Audio) on my hard disk :)
While working on the project, I thought that it’d be cooler to write a full decoder instead of this mere proof-of-concept »look what I can do to my MP2 files« hack. So I developed a small MPEG-1 Audio Layer II decoding library called kjmp2 which eventually evolved into a less-than-4k MP2 player application …

How MP2 works

Basically, MPEG Audio Layers I and II are built around a polyphase quadrature filter that transforms 32 consecutive time-domain samples into 32 frequency-domain values (»subband samples«). This process is performed over a 512-sample window. 36 of these 32-sample runs are packed together in one 1152-sample frame, which is the smallest atomic data unit in the stream that can be decoded independently.

In the encoder, the subband samples are normalized and quantized. Normalization means that for 18 consecutive samples of each of the 32 subbands a scalefactor is stored. The sample values are then transmitted relative to the scalefactor. Quantization then clips these relative sample values to something between 2 and 16 bits. The quantization parameters are stored on a per-frame basis and are also kown as allocation information. Using allocation, subbands can also be eliminated completely. In fact, MP2 never transmits all 32 subbands: Even in the best case, the spectrum is cut off at subband 30 (which equals 20.6 kHz at a 44.1 kHz sample rate).

Stereo data can either be represented as two completely independent channels or using the so-called joint stereo encoding. In this mode, all subbands below a certain threshold are encoded like normal independent stereo signals, while the upper subbands may have different scalefactors for the left and right channels, but share the same sample values. This process is called intensity stereo coding and basically generates a mono signal with some panning for the affected subbands.

Fighting with the standard

The ISO 11172-3 standard is quite old and presumably never existed in digital form. However, there are some Word documents flying around in the internet which are supposedly scanned and OCR’ed version of the original documents. This is very good, because the 512 reconstruction window coefficients (specified as decimal fractions, though the values are actually 17-bit integers divided by 65536) would have been a real pain to type in :)

Other than that, the standard has some other flaws: While it gives a great overview of how the decoder works, it is sparse on details. Some things have to be derived from common sense or experience with other coding schemes (my video coding knowledge helped a lot there). On the other hand, there are also places with lots of redundant information, like long tables that turn out to follow a simple one-line rule. The most conspicuous example are the allocation tables: There are four huge tables, each covering a certain samplerate/bitrate range. The funny thing is that every two of them are completely identical, except for the cutoff subband. I substituted them by a 4-level hierarchical table and voilà, I got it down to 185 sparsely used bytes of table data.

However, my biggest gripe was the specification of the renormalization process (the inverse of the quantization step). The wording from the standard, »a two’s complement integer with the MSB meaning -1«, wasn’t at all helpful. This was the only point during the whole implementation that I really looked at the ffmpeg source code to figure out what’s going on. The code there wasn’t really understandable either (the whole ffmpeg source is a mess!), but at least it gave me some clue on how to solve the puzzle myself. I ended up with an implementation that is neither the one from the standard, nor the one from ffmpeg – it’s simply what it ought to be: renormalization. Instead of doing some obscure binary fraction math, I just read the number from the bitstream and scale it from (0..2), (0..6), (0..30), (0..16382) or something like that to (-32768..32767). Period. And guess what? It works just as well.

The result

I got the basic decoder working after about 14 hours of work this weekend (not counting the sleep I had in the middle). To my great surprise, it worked from the start. There were some obvious errors like wrong loop indexes, but only one real logic error, which was easy to fix: The sound was too quiet by a factor of 1024. So the only thing that really went wrong is a miscalculation of the bit precision of my fixed-point integers. I adjusted some shifts here and there and finally I got a correct signal. Yesterdey, I fixed two other obvious bugs, and now kjmp2 sounds reasonable on all input files I tested with.

The final result is a library that consists of less than 400 code lines. To make the decoder usable, I also wrote example player applications for Linux and Windows. The Linux one uses OSS as output and is around 7 KiB (UPX’ed). The Windows version is substantially more interesting, though: Thanks to Crinkler, I got the whole application down to 3.63 KiB! (And this is despite the fact that things like command line parsing and audio output are much harder to do in Win32 :)

Download

kjmp2.zip (23k) — the source code and compiled example applications
example track: coming soon :)