Notes on Audio Granulation

Retiming audio with granulation – and a simple sample in a Unity WebGL example.

If we had data for an audio clip and asked someone how to change the playback speed, we’d get a common answer for both, “change the rate we process and play the audio” or “resample the audio data to go through the samples faster or slower.” The problem is that naively changing the timing changes the playback frequency.

By going through the audio data twice as fast, we double the frequency of the audio content.

So how do we change the timing of the audio, while keeping the same pitch?

Examples

We’ll start with a few audio examples to define the problem and preview the ideal result. Here’s the original audio.

Sample audio.

And here’s that same audio sample with the audio rate divided by two and multiplied by two. Note that this changes not only the length of the playback but also the audio frequencies.

Sample audio at half speed.
Sample audio at double speed.

How do we change the speed of the audio content without changing the frequency? The answer is audio granulation. Where audio is chopped into little segments (i.e., grains), processed, retimed, and then stitched back together.
Is it just me, or is odd that shredding something into grains is spelled granulation instead of grainulation?

Sample audio at a higher speed, but same pitch.
Two video applications that allow slowing down the playback speed – where slowing it down doesn’t change the audio pitch.
Left) VLC.
Right) YouTube.

Graining The Audio

For example, let’s take this audio waveform:

The example waveform.

Let’s cut it into segments. For this example, we’ll use uniform grain widths. Width each grain, we slice out a region of audio and tag it with a timestamp of where it originally came from.

Cutting the example waveform into grains.

In this example, the grains are very wide, but it helps make things more visible.
And it creates less work for me to chop up the image.
The multiple rows in the diagrams don’t mean anything special, it just makes it easier to show overlapping elements on a timeline.

Notice how we have multiple layers that overlap each other. Now if we wanted to make this audio sample slower, we make it take longer by separating out the grains:

Scaling the grain start times with a value over 1.0, to make them cover more time.

And to do the opposite, to speed up the audio, let’s squish the grains:

Scaling the grain times with a value less than 1.0, to make them cover less time.

You can play around with the grain size you extract, and you can also take the knowledge of how you’re going to scale the grains into consideration when figuring out the grain width.

Restitching The Grains

In the previous diagrams, for separating the grains apart, there are some moments where there’s only 1 grain at any time; for scaling down, there are some moments in time where up to 3 grains are overlapping.

To make sure the loudness of the audio doesn’t jump around when we enter or exit regions of overlapping grains, the grains are given envelopes. These make it so that as we approach the edges of the grain, the gain attenuates to zero. This allows grains to crossfade into each other.

An example of crossfading, where the transparency is used to outline a meter of a grain’s weight envelope. Not that this envelope is a weighting and not the loudness – its contribution will be relative to what else it’s overlapping with at a given time – put into a weighted average.

If you’re smart about it, you can make it so that there are always 2 overlapping grains that perfectly crossfade into each other. But you can also blindly throw in overlapping grains and perform a weighted average to allow for an arbitrary number of overlapping grains at any point, for any random envelope value.

In the image below, we ensure no more than 2 overlaps occur, so stitching simply involves adding their enveloped samples. And the additive envelopes of any moment, sum up to 1.0.

Show an example of envelopes where they crossfade perfectly into each other. This requires simply adding all samples for each moment in time to stitch the audio.

Example

So here’s a raw C# implementation in Unity. The interface is a bit primitive and rigamaroley, but it does the job. The source for the project can be accessed from the GitHub repo.

To test the sample:

  • Provide a short audio sample with a mic (microphone support is a bit touch-and-go) or use the included sample.
  • Select how the audio should be granulated, and then press the “Granulate” button.
  • Select a restitching playback speed, and hit the “Reconstruct” button to rebuild the audio and play it.

There’s also some weird clinking sound in the browser version – no idea what that is. Actually, the browser has weird finicky issues in general – if it starts permanently acting up, try refreshing the webpage.

Code Snippets

Some code samples from the project. I won’t go too in-depth with this, that’s what the actual project is for.

Here’s a shortened version of how a grain is represented.

public class Grain
{
    // The cached original start of where it was taken from the auidio.
    public float origStart;
    // The time, in seconds, on when it should start adding content to the stream.
    public float start;

    public float width; // in seconds.
    public float inEdge; // Entrance envelope time duration.
    public float outEdge; // Exit envelope time duration.
    public float [] samples; // PCM, sample could equals width * samplerate
}

Extracting grains from the original PCM just involves taking the start value and multiplying it by the sample rate to find the starting sample – and then filling up the grain’s PCM copy samples array.

To retime the grains, we just multiply the start values.

/// <summary>
/// Given a list of grains, scale the start time of the entire list
/// by a constant value.
/// </summary>
/// <param name="grains">The grains to start the start times.</param>
/// <param name="scale">The amount to scale the start times.</param>
/// <param name="compound">If true, scale by current grain time, else scale by the original time.</param>
public static void ScaleGrainTime(List<Grain> grains, float scale, bool compound)
{
    for(int i = 0; i < grains.Count; ++i)
    {
        if(compound == true)
            grains[i].start *= scale;
        else
            grains[i].start = grains[i].origStart * scale;
    }
}

And I’m going to pass over the restitching code because it’s a bit long. Not because it’s too complex but rather tedious.

Other Parameters

Besides retiming, there are other things we can tweak between grains, or on a per-grain level. For example, we can change the sample rate of the grains, allowing us to change the timing and pitch independently.

Artefacts

The biggest issue you’ll probably come up with is choppiness when the audio is slowed down a lot – a.k.a, the Funk Soul Bro-o-o-O-O-o-o-ther effect (technical term). At a certain level of slowing the audio down, you can start to easily hear the individual grains. This seems unavoidable. Audio and video players, from YouTube, browser audio players, Audacity, VLC, and others, all suffer from it.

Another issue is if too many grains are overlapping, you’ll get a highly-reverbed texture when restitching the audio – although this is easy to fix, just limit the number of overlapping grains.

– Stay strong, code on. William Leu.
Tested in Chrome. Built with Unity 2019.4.16f1

Explore more articles here.
Explore more audio articles here.