©  Jan Newmarch 2017

Jan Newmarch, Linux Sound Programming, 10.1007/978-1-4842-2496-0_3

3. Sound Codecs and File Formats

Jan Newmarch

(1)Oakleigh, Victoria, Australia

There are many different ways of representing sound data. Some of these involve compressing the data, which may or may not lose information. Data can be stored in the file system or transmitted across the network, which raises additional issues. This chapter considers the major sound codecs and container formats.

Overview

Audio and video data needs to be represented in digital format to be used by a computer. Audio and video data contains an enormous amount of information, so digital representations of this data can occupy huge amounts of space. Consequently, computer scientists have developed many different ways of representing this information, sometimes in ways that preserve all the information (lossless) and sometimes in ways that lose information (lossy).

Each way of representing the information digitally is known as a codec. The simplest way, described in the next section, is to represent it as “raw” pulse-code modulated (PCM) data. Hardware devices such as sound cards can deal with PCM data directly, but PCM data can use a lot of space.

Most codecs will attempt to reduce the memory requirements of PCM data by encoding it to another form, called encoded data. It can then be decoded back to PCM form when required. Depending on the codec algorithms, the regenerated PCM may have the same information content as the original PCM data (lossless) or may contain less information (lossy).

Encoded audio data may or may not contain information about the properties of the data. This information may be about the original PCM data such as the number of channels (mono, stereo), the sampling rate, the number of bits in the sample, and so on. Or it may be information about the encoding process itself, such as the size of framed data. The encoded data along with this additional information may be stored in a file, transmitted across the network, and so on. If this is done, the encoded data plus the additional information is amalgamated into a container.

It is important at times to know whether you are dealing with just the encoded data or with a container that holds this data. For example, files on disk will normally be containers, holding additional information along with the encoded data. But audio data manipulation libraries will typically deal with the encoded data itself, after the additional data has been removed.

PCM

This definition comes from Wikipedia: “Pulse-code modulation is a method used to digitally represent sampled analog signals. It is the standard form for digital audio in computers and various Blu-ray, DVD, and CD formats, as well as other uses such as digital telephone systems. A PCM stream is a digital representation of an analog signal, in which the magnitude of the analog signal is sampled regularly at uniform intervals, with each sample being quantized to the nearest value within a range of digital steps.”

PCM streams have two basic properties that determine their fidelity to the original analog signal: the sampling rate, which is the number of times per second that samples are taken, and the bit depth, which determines the number of possible digital values that each sample can take.

PCM data can be stored in files as “raw” data. In this case, there is no header information to say what the sampling rate and bit depth are. Many tools such as sox use the file extension to determine these properties.

According to http://sox.sourceforge.net/soxformat.html , “f32 and f64 indicate files encoded as 32- and 64-bit (IEEE single and double precision) floating-point PCM, respectively; s8, s16, s24, and s32 indicate 8, 16, 24, and 32-bit signed integer PCM, respectively; u8, u16, u24, and u32 indicate 8, 16, 24, and 32-bit unsigned integer PCM, respectively.”

But it should be noted that the file extension is only an aid to understanding some of the PCM codec parameters and how they are stored in the file.

WAV

WAV is a file format wrapper around audio data as a container. The audio data is often PCM. The file format is based on the Resource Interchange File Format (RIFF) . While it is a Microsoft/IBM format, it does not seem to be encumbered by patents.

A good description of the format is given by Topherlee ( www.topherlee.com/software/pcm-tut-wavformat.html ). The WAV file header contains information about the PCM codec and also about how it is stored (for example, little- or big-endian).

Because WAV files usually contain uncompressed audio data, they are often huge, around 50Mb for a three-minute song.

MP3

The MP3 and related formats are covered by a patent (actually, a whole lot of patents). For using an encoder or decoder, users should pay a license fee to an organization such as the Fraunhofer Society. Most casual users neither do this nor are aware that they should, but it is reported by Fraunhofer ( www.itif.org/files/2011-fraunhofer-boosting-comp.pdf ) that in 2011 the MP3 patent “generates annual tax revenue of about $300 million.” The Fraunhofer Society has currently chosen not to pursue free open source implementations of encoders and decoders for royalties.

The codec used by MP3 is the MPEG-1 Audio Layer III ( http://en.wikipedia.org/wiki/MP3 ) audio compression format. This includes a header component that gives all the additional information about the data and the compression algorithm. There is no need for a separate container format.

Ogg Vorbis

Ogg Vorbis is one of the “good guys.” According to Vorbis.com, “Ogg Vorbis is a completely open, patent-free, professional audio encoding and streaming technology with all the benefits of open source.”

The names break down as follows:

  • Ogg: Ogg is the name of Xiph.org’s container format for audio, video, and metadata. This puts the stream data into frames that are easier to manage in files and other things.

  • Vorbis: Vorbis is the name of a specific audio compression scheme that’s designed to be contained in Ogg. Note that other formats are capable of being embedded in Ogg such as FLAC and Speex.

The extension .oga is preferred for Ogg audio files, although .ogg was previously used.

At times it is necessary to be closely aware of the distinction between Ogg and Vorbis. For example, OpenMAX IL has a number of standard audio components including one to decode various codecs. The LIM component with the role “audio decoder ogg” can decode Vorbis streams. But even though the component includes the name ogg, it cannot decode Ogg files, which are the containers of Vorbis streams. It can only decode the Vorbis stream. Decoding an Ogg file requires using a different component, referred to as an “audio decoder with framing.”

WMA

From the standpoint of open source, WMA files are evil. WMA files are based on two Microsoft proprietary formats. The first is the Advanced Systems Format (ASF) file format, which describes the “container” for the music data. The second is the Windows Media Audio 9 codec.

ASF is the primary problem. Microsoft has a published specification ( www.microsoft.com/en-us/download/details.aspx?id=14995 ) that is strongly antagonistic to anything open source. The license states that if you build an implementation based on that specification, then you:

  • Cannot distribute the source code

  • Can only distribute the object code

  • Cannot distribute the object code except as part of a “solution” (in other words, libraries seem to be banned)

  • Cannot distribute your object code for no charge

  • Cannot set your license to allow derivative works

And what’s more, you are not allowed to begin any new implementation after January 1, 2012, and (at the time of writing) it is already 2017!

Just to make it a little worse, Microsoft has patent 6041345, “Active stream format for holding multiple media streams” ( www.google.com/patents/US6041345 ), which was filed in the United States on March 7, 1997. The patent appears to cover the same ground as many other such formats that were in existence at the time, so the standing of this patent (were it to be challenged) is not clear. However, it has been used to block the GPL-licensed project VirtualDub ( www.advogato.org/article/101.html ) from supporting ASF. The status of patenting a file format is a little suspect but may become a little clearer now that Oracle has lost its claim to patent the Java API.

The FFmpeg project ( http://ffmpeg.org/ ) has nevertheless done a clean-room implementation of ASF, reverse-engineering the file format and not using the ASF specification at all. It has also reverse-engineered the WMA codec. This allows players such as MPlayer and VLC to play ASF/WMA files. FFmpeg itself can also convert from ASF/WMA to better formats such as Ogg Vorbis.

There is no Java handler for WMA files, and given the license, there is unlikely to be one unless it is a native-code one based on FFmpeg.

Matroska

According to the Matroska web site ( http://matroska.org/ ), Matroska aims to become the standard of multimedia container formats. It was derived from a project called MCF but differentiates from it significantly because it is based on Extensible Binary Meta Language (EBML) , a binary derivative of XML. It incorporates features you would expect from a modern container format, such as the following:

  • Fast seeking in the file

  • Chapter entries

  • Full metadata (tags) support

  • Selectable subtitle/audio/video streams

  • Modularly expandable

  • Error resilience (can recover playback even when the stream is damaged)

  • Streamable over the Internet and local networks (HTTP, CIFS, FTP, and so on)

  • Menus (like DVDs have)

I hadn’t come across Matroska until I started looking at subtitles,1 which can be (optionally) added to videos, where it seems to be one of the major formats.

A GUI tool to create and manage subtitles in Matroska file format (MKV) files is mkvmerge, in the Ubuntu repositories. HYPERLINK " https://mkvtoolnix.download/ " MKVToolNix is a GUI tool to handle MKV files.

Conclusion

There are many codecs for sound, and more are being devised all the time. They vary between being codecs, containers, or both, and they come with a variety of features, some with encumbrances such as patents.

Footnotes

1 Subtitles and closed captions are similar but distinct. According to https://www.accreditedlanguage.com/2016/08/18/subtitles-and-captions-whats-the-difference/ , “Subtitling is most frequently used as a way of translating a medium into another language so that speakers of other languages can enjoy it. Captioning, on the other hand, is more commonly used as a service to aid deaf and hearing-impaired audiences.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset