Since the GIMP is freely available on Windows, some would say that it's pointless to keep writing basic open-source or freeware image editors. I'm not one of those people, mind you, but some questions keep occurring to me: Why are there so many image manipulation programs available and in development, especially when MS Paint is a vast improvement on many? Why are there so few audio editors, when there's a serious deficit of them and when such basic functionality as Wave file division is missing from the easily-available freeware?
My guess is that people feel more comfortable working with images. After all, there are loads of pre-existing image controls, and comparatively few audio ones. There also seems to be a continuing refrain throughout the coding community: "audio manipulation is a black art". With this series of articles, I'll describe the design and creation of a basic (but hopefully, robust and powerful) command line "swiss army knife" for audio file manipulation. We'll be building the tool modularly, so it should be easy for you to make (and hopefully release!) your own contributions.
In this article, we'll discuss the RIFF file format, and more specifically the PCM RIFF-wave. We'll detail the most common data structures that compose it, and briefly discuss the variants you might see. Finally, we'll develop a "profiler" that parses, loads into memory, and outputs as XML, the relevant file data.
RIFF is an all-purpose multimedia file format created by Microsoft and IBM, way back in 1991. Wave audio isn't the only multimedia stored in a RIFF file; AVI video, too, uses the RIFF. (For more information on the history of RIFF and its Amiga ancestor, IFF, see Wikipedia .)
Every RIFF file starts with a header with three four-byte fields. The data structure is this:
public string sGroupID; // Surprisingly enough, this is always "RIFF"
public uint dwFileLength; // File length in bytes, measured from offset 8
public string sRiffType; // In wave files, this is always "WAVE"
RIFFs are composed of sections called, awkwardly enough, "chunks". Each chunk starts with an eight-byte header:
public string sChunkID; // Four bytes: "fmt ", "data", "fact", etc.
public uint dwChunkSize; // Length of header in bytes
Unfortunately, while some official documents established the basics of the file format, no official standard was ever published for the wave file. In the absence of official documents, people did what they do best: improvise. As a result, there are many different chunk types, many of which duplicate and triplicate functionality. For the time being, we'll ignore most of these chunk types, and focus on the two that are guaranteed to be in every wave file: the format chunk and the data chunk.
The format chunk details all the necessary information about the audio data, including the format of the audio (we assume, for now, uncompressed Pulse Code Modulation audio), the number of channels (mono, stereo, quadraphonic, 5-channel), the audio's frequency, the number of bits per audio sample (usually, 8 or 16), and the number of bytes in a frame. The data structure for this chunk is this:
public string sChunkID; // Four bytes: "fmt "
public uint dwChunkSize; // Length of header in bytes
public ushort wFormatTag; // 1 if uncompressed Microsoft PCM audio
public ushort wChannels; // Number of channels
public uint dwSamplesPerSec; // Frequency of the audio in Hz
public uint dwAvgBytesPerSec;// For estimating RAM allocation
public ushort wBlockAlign; // Sample frame size in bytes
public uint dwBitsPerSample; // Bits per sample
Wait just a minute, you say. Frame? A frame is the same thing as a sample , but not the same thing as a sample . Understand? You've got to keep this straight; it's very important. OK, OK, I'll explain. A frame is one whole multichannel audio sample. The SamplesPerSecond
field actually gives the number of frames per second. The BitsPerSample
field, on the other hand, refers to the number of bits in a single channel of a sample.
The other chunk is even more essential: the data chunk. As you might expect, it contains all the PCM audio data. It has a very simple data structure:
public string sChunkID; // Four bytes: "data"
public uint dwChunkSize; // Length of header in bytes
// Different arrays for the different frame sizes
public byte [] byteArray; // 8 bit unsigned data; or...
public short [] shortArray; // 16 bit signed data
What effect does the signed/unsigned convention have on our data? Well, that's a very good question, and one we'll address when we take apart a sample file below.
There's even another complication, though: there's no guaranteed order for the chunk data! Since no standard was ever published, it's technically legal to put the data
chunk, which stores the actual audio data, in front of the fmt
(format) header which tells the user how to process it. Though this is never done, a well-written audio program will account for it anyway. A more common mistake made by new audio programmers is to assume that the fmt
header comes first in a file; while this is mostly the case, there are several audio programs out there that generate non-compliant Wave files.
One final thing to note with regard to RIFF chunks: they must have an even number of bytes. In the case that a chunk has an odd number, it must be padded out with zeros. There's only one case in which this is possible, for our immediate purposes: the data stream of a 8-bit mono file.
This is all well and good, but what do things look like inside the file? Will the old Lilliputian struggle ("Big end!" "No, little end!") rear its sinister head again? If you guessed yes, you're right, and you can skip ahead two paragraphs. If you have no idea what those terms mean, here's a little digression.
Long ago, in a country called Intellia, a microprocessor designer said, "Fa! These Motorolia engineers have made things too easy with their Big-Endian memory storage. When their processors write an integer to disk, its bytes go, one by one, in order, onto the disk. The stack fills downwards. Things work the way an assembly programmer would enjoy. We must end this!" He stopped, for a moment, contemplating evil and counterintuitive ways of writing assembly code, to ponder a way to make memory storage more difficult and confusing for the programmer. "I have it!" he cried, "we'll make the stack fill upwards, and store sequential bytes... BACKWARD! Bwahahaha!"
Well, the true story has more to do with inconvenient things called "patents", but the important thing to remember is this: big-endian systems (using Motorola chips and their descendents) store data to disk in the same byte order they're arranged in memory. If you have a short
value 0x4567 and write it to disk on a big-endian system, it will be stored on disk as 0x4567. On a little-endian system (using Intel chips and their relatives), it will be stored as 0x6745. The bits of each byte are in the same order, but the order of the bytes changes.
So, is the Wave file's data stored in little-endian or big-endian format? Given that the format was put together by Microsoft and IBM, it should be no surprise that it uses little-endian format for both field and audio data.
In the image below, you see all the headers for the file: the three 32-bit (double-WORD) fields of the RIFF header (highlighted in red), the fields of the format chunk (highlighted in green), the fields of the fact chunk (in blue; see the end of this article for a very brief discussion of this chunk), and the very beginning of the data chunk (in yellow).
When you convert the hex values to decimal, you should obtain the following values:
<? xml version =" 1.0" ? >
- < WaveFile xmlns:xsd =" http://www.w3.org/2001/XMLSchema"
xmlns:xsi =" http://www.w3.org/2001/XMLSchema-instance" >
- < maindata >
< sGroupID > RIFF< / sGroupID >
< dwFileLength > 407534< / dwFileLength >
< sRiffType > WAVE< / sRiffType >
< / maindata >
- < format >
< sChunkID > fmt< / sChunkID >
< dwChunkSize > 18< / dwChunkSize >
< wFormatTag > 1< / wFormatTag >
< wChannels > 2< / wChannels >
< dwSamplesPerSec > 44100< / dwSamplesPerSec >
< dwAvgBytesPerSec > 176400< / dwAvgBytesPerSec >
< wBlockAlign > 4< / wBlockAlign >
< dwBitsPerSample > 16< / dwBitsPerSample >
< / format >
- < fact >
< sChunkID > fact< / sChunkID >
< dwChunkSize > 4< / dwChunkSize >
< dwNumSamples > 101871< / dwNumSamples >
< / fact >
- < data >
< sChunkID > data< / sChunkID >
< dwChunkSize > 407484< / dwChunkSize >
< / data >
< / WaveFile >
[Why XML? Well, for one reason, because it's easy to navigate the output. Also, though, because an XML document may come in handy later on when we want to implement more advanced features -- the XML document may save peak values, a record of changes, etc. .NET's XML parsing methods make it a very convenient method of organized data storage and access. Finally, it's a good thing to have some experience with. CodeProject is all about learning, so if you've never worked with XML before, you have a reason to start now.]
Hopefully, you understand how we retrieved all the various information in the XML above. Just one more comment on the diagram above and we'll look at some of the actual code. As you can see, there is one final double-word after the chunkID
and chunkSize
double words in the data chunk. As you might guess, this is the first frame. As the file is 16-bit stereo, we know which bytes are what:
F4 06 3E FF
is the first frame. 0x06F4
is the first sample in the left stereo channel. 0xFF3E
is the first sample in the right stereo channel. It might be a useful exercise to figure out what these four bytes would mean in other configurations:
If you're interested in even more information on RIFF and RIFF/Wave files, check out the SonicSpot guide and an amateur (in the good sense!) attempt at a comprehensive RIFF/Wave specification.
If, on the other hand, you'd like to just get to the code, read on.
Although we're going to call the software WaveEdit from the beginning, it might be more appropriately called "WaveInfo", since this first version merely gets the Wave data and writes it to XML. Before we dive into the code base, though, let's define a few requirements.
There are a few standard operations every decent audio editor must do: volume adjustment, file truncation, pitch/tempo control, and maybe fade-ins/fade-outs. In addition to these functions, which we'll be adding in the rest of the series, I'm adding Wave file division. If you've ever recorded on your own, or transferred a vinyl record or an audio tape to CD, you'll understand why this is necessary. It's a relatively straightforward function, but none of the currently available freeware audio software seems to support it.
Most of the code is very straightforward (and, I believe, fairly readable). We'll spotlight a few sections of code: the interesting sections, the confusing sections, and the sections I feel like discussing.
The first thing you'll notice in EntryPoint.cs is the initialization of the XML serializer. Let's put all the XML code in one place:
XmlSerializer xmlout = new XmlSerializer(typeof (WaveFile));
Stream writer = new FileStream(args[1 ], FileMode.Create);
...
xmlout.Serialize(writer, contents);
The XML serializer creates a "template" for the XML file depending on the class type that's passed to it. We pass it WaveFile
, which has the following class definition:
public class WaveFile {
public riffChunk maindata;
public fmtChunk format;
public factChunk fact;
public dataChunk data;
}
These "chunkdata" data structures are defined in Structs.cs and contain (mostly) the data you've already seen. The riffChunk
class includes a field to store the filename:
public class riffChunk {
public string FileName;
// These three fields constitute the riff header
public string sGroupID; // RIFF
public uint dwFileLength; // In bytes, measured from offset 8
public string sRiffType; // WAVE, usually
}
The dataChunk
class includes four new fields:
public class dataChunk {
public string sChunkID; // Four bytes: "data"
public uint dwChunkSize; // Length of header
public long lFilePosition; // Position of data chunk in file
public uint dwMinLength; // Length of audio in minutes
public double dSecLength; // Length of audio in seconds
public uint dwNumSamples; // Number of audio frames
}
lFilePosition
is used to store the position of the beginning of the audio data in the file; this will aid us in editing later. dwMinLength
and sSecLength
are primarily for human benefit in the XML file. Finally, dwNumSamples
duplicates a field from the fact
header, which:
We use a custom FileReader
called WaveFileReader
to retrieve the data from the Wave files. In addition to conforming to good coding conventions, this streamlines the code: in the EntryPoint
class, we just look at the "big picture", while in the WaveFileReader
class, we only care what's going on in one small place at a time. The resulting code is very easy to understand:
WaveFileReader reader = new WaveFileReader(args[0 ]);
WaveFile contents = new WaveFile();
contents.maindata = reader.ReadMainFileHeader();
contents.maindata.FileName = args[0 ];
How do we solve the problem of reading in chunks in a possibly random order? A while
loop and a series of if
statements setup will serve our purposes nicely:
while (reader.GetPosition() < (long ) contents.maindata.dwFileLength)
{
temp = reader.GetChunkName();
if (temp==" fmt " )
{
contents.format = reader.ReadFormatHeader();
if (reader.GetPosition() +
contents.format.dwChunkSize ==
contents.maindata.dwFileLength)
break ;
}
else if (temp==" fact" )
{
contents.fact = reader.ReadFactHeader();
if (reader.GetPosition() +
contents.fact.dwChunkSize ==
contents.maindata.dwFileLength)
break ;
}
else if (temp==" data" )
{
contents.data = reader.ReadDataHeader();
if (reader.GetPosition() +
contents.data.dwChunkSize ==
contents.maindata.dwFileLength)
break ;
}
else
{ // This provides the required skipping of unsupported chunks.
reader.AdvanceToNext();
}
}
Finally, we'll dig into the WaveFileReader
code. WaveFileReader
has the same fields as WaveFile
, for reasons that will soon become clear. There's also the BinaryReader
reader
, which is what we'll use to access the Wave file. We initialize reader
with a custom constructor.
public class WaveFileReader : IDisposable
{
BinaryReader reader;
riffChunk mainfile;
fmtChunk format;
factChunk fact;
dataChunk data;
public WaveFileReader(string filename)
{
reader = new BinaryReader(new FileStream(filename,
FileMode.Open, FileAccess.Read, FileShare.Read));
}
}
None of the fields in WaveFileReader
are public (that's what WaveFile
is for!), so we need to write interface methods where appropriate. We especially need methods to deal with reader
, since it's the most important piece of the whole structure. At the very minimum, we need functions to:
public long GetPosition() { return reader.BaseStream.Position; }
public string GetChunkName() { return new string (reader.ReadChars(4 )); }
public void AdvanceToNext() {
// Get next chunk offset
long NextOffset = (long ) reader.ReadUInt32();
// Seek to the next offset from current position
reader.BaseStream.Seek(NextOffset,SeekOrigin.Current);
}
These "general filestream" functions are in the General Utilities #region
of WaveFileReader.cs .
Finally, we have the header extraction functions. These are largely the same, so we'll just look at the most complicated of them... which isn't very complicated after all.
public dataChunk ReadDataHeader()
{
data = new dataChunk();
data.sChunkID = " data" ;
data.dwChunkSize = reader.ReadUInt32();
// ReadUInt32 is the most important function here.
// Once we've read in the ChunkSize,
// we're at the start of the actual data.
data.lFilePosition = reader.BaseStream.Position;
// If the fact chunk exists, we don't have to calculate
// the number of samples ourselves.
if (!fact.Equals(null ))
data.dwNumSamples = fact.dwNumSamples;
else
data.dwNumSamples = data.dwChunkSize /
(format.dwBitsPerSample/8 * format.wChannels);
// The above could be written as data.dwChunkSize / format.wBlockAlign,
// but I want to emphasize
// what the frames look like.
data.dwMinLength = (data.dwChunkSize / format.dwAvgBytesPerSec) / 60 ;
data.dSecLength = ((double )data.dwChunkSize /
(double )format.dwAvgBytesPerSec) -
(double )data.dwMinLength*60;
return data;
}
At this point, I have enough material and motivation for a three-part series. The next part will cover pitch and volume adjustment; the third will cover truncation and file division. If there's a lot of interest, though, the series can be extended to cover many things, from digital signal processing to fast Fourier transforms (for viewing the frequency spectra of the file). See you in part 2!
fact
headerThe fact
header is alarmingly straightforward. The data structure is merely:
public string sChunkID; // Four bytes: "fact"
public uint dwChunkSize; // Length of header
public uint dwNumSamples; // Number of audio frames
The number of samples can be calculated by multiplying format.dwSamplesPerSecond
by the length of the file in seconds.