How to make a spatial mix from a stereo track?

2024-05-138 Min read

Subscribe to our newsletter

By subscribing you agree to with our Privacy Policy.

Share this post

What is spatial audio?

As a brief reminder, "Spatial Audio" encompasses a range of audio playback technologies designed to replicate real-world sound experiences in three dimensions. In contrast to any stereo mixes with only left and right channels, where audio sources are limited to a small scene in front of the listener, spatial audio configurations aim to envelop us from all directions, providing a heightened sense of immersion.

Where does spatial audio comes from?

As soon as we went from mono output to two-channel stereo sound (with modern stereophonic technology invention in the 1930s), the ambition to come closer to the natural listening experience has always been the next step.

But it is not until the early eighties with the home video industry development and more truly the mid-nineties with the LaserDisc and DVDs that came surround sound. Starting with 5.1 setup, it added front/back dimension, thus covering all the horizontal plane, let alone the low-frequency channel routed to a subwoofer (the “.1”). Soon followed by 7.1, reinforcing left/right dimension.

Remember, it is commonly accepted (as far as industry experts) that we can speak of “spatial” when sound is powered in 3 dimensions (left/right, front/back, and up/down, or differently said angle, distance, and elevation). Well, here it came, this third dimension was introduced by the 7.1.4 setup (adding different levels of heights for the sound sources), with the top four speakers added.

Three representations are possible when it comes to spatial audio mix:

Channel based
When the audio mix is done for an already defined number of speakers. The master file counts as many channels as speakers. At playback a processing unit must downmix or upmix from the audio file according to the user audio device. That format is used for traditional stereo, 5.1 and 7.1 for example.
Object based
When the audio mix is done by positioning each individual stem (instrument or audioFX) on the soundstage. The master file counts as many channels as objects. At playback, a processing unit must recreate the sound scene by mixing the objects together and generate a channel based output depending on the user audio device. That format is used for audio gaming for example.
Scene based
The fundamental concept behind scene based involves treating an audio scene as a complete 360-degree sphere of sound emanating from various directions around a central point. This central point serves as the location of the microphone during recording or the listener's "sweet spot" during playback. At playback, a processing unit must generate a channel based output according to the user audio device.

The spatial audio production workflow

Not your usual few steps process.

For a track to be compliant with a spatial audio standard must be decided at the very early stages of the production. It requires recording and mixing the composition with the spatial dimension in mind. In this context each track or instrument has the potential to occupy its own distinct space within a three-dimensional sound field, adhering to the specifications outlined by the original composer or mixer.

And there’s more as the Dolby Atmos format is concerned:

On the production side, the sound engineers monitor their mix by listening over speakers and headphones using the Atmos renderer*. With a DAW (digital audio workstation) they can modify audio settings to control the final mix. Then they export the mixing session as an Atmos BWF Master file to be sent to music streaming services and download stores.

(*) Atmos renderer converts the channel and object based mix into a channel based stream according to the user audio device

The problem of the spatial audio

As one can easily figure, these additional production steps involve two scarce resources: time and money. As spatial audio encompasses several industry formats (Dolby Atmos, Sony 360 Reality Audio, Ambisonics, the future IAMF), each requiring its own process.

You need to book a studio equipped with the appropriate software and gear, along with a skilled sound engineer. Depending on the latter availability and the studio planning, it could take as long as 3 weeks to spatialize an album.

As time is money, budget will follow. Even more to hire a certified Dolby Atmos sound engineer, which proves a highly advisable move, Dolby Atmos being the spatial standard elected by the biggest streaming platform offering spatial audio to their clients (Apple Music). If we take the album as a unit it will cost around $6,000 and up to $600 for a one-shot single track. Can independent artists or record labels afford spatial audio in these conditions? Clearly no.

Besides, crafting spatial tracks that truly exude pristine sound quality can be exceedingly challenging, particularly when working from stems, which is the conventional approach these days. An engineer doing a spatial version in studio will usually manipulate exported stems not always designed for a spatial dimension. It often leads to a deceptive experience for the listener, sounding artificial, especially when played back on headphones.

This is why we’re convinced that an alternative, seamless, affordable and device agnostic spatialization process for record labels and artists will remove these barriers and foster spatial audio as the go-to format for listeners, on whatever device.

And what about an old track or all the back catalogs worldwide, obviously not recorded in spatial audio and even without stems available in the first place? Do we resolve to keep them locked in the stereo age?

The spatialization from stereo solution

The spatialization process from the stereo master file presents the advantage to stick to the artist’s original vision, mixed and mastered all the way through in a stereo context.

Basically, a stereo file includes 2 channels intended for 2 speakers. With that said, the lazy way would consist of just sending the same signal on whatever number of speakers. While you will be surrounded by X same signals, it will definitely not result as an immersive experience.

Actually it’s all about routing to each channel and thus to each speaker a dedicated information, to create the spatial landscape. That’s the principle we need to build, with different paths to go through, basic and more innovative ones.

The simplified way to spatialize a stereo file

So far, we came up with two ways of processing the stereo file up to spatial format: (remind the “up” word for later)

Adding reverb

It will sound immersive, as if a room was reverbering the sound, though the room doesn’t exist. The main criticism we can address with this method is the alteration of the artist’ intention, think of an intimist voice propelled into a deep reverberating space such as a cathedral for example… Plus not one reverb will sound good for all songs, it’s not a one-size-fits-all option. Yet this is the mere trick used by (too) early automated spatialization offers or so-called (scammers?) spatial audio engine you may come across the internet.

Place instruments in a different location than the original file

The stereo signal will be split and reallocated across the sound stage. For example, you extract the following three stems: vocal, drum, and “other”. As for the relocation, you will place the vocal in the center speaker, the drum without the voice on the front left and front right, and ultimately all the rest ("other") on the side speakers. It is what we call “upmixing” the stereo (hence the above notice to the “up” word), separating the stems and splitting them even more to obtain enough tracks to send to the different speakers.
Yet, we’re still talking of stereo!

As far as the stem extraction is concerned, it’s definitely not the right starting point or criterion. Distributing them differently in space will not result in spatial audio, even with the original stems available initially.

With true spatial audio, to really immerse the listener, each dot in the space receives its own additional data, on top of the data contained in the original signal. If we were to downmix a spatial audio track to stereo, we would obtain more data than in the original stereo. Get back to the upmixed stereo process: nothing’s added, nothing’s removed, and the sum of channels equals the original stereo.

If we focus on the Dolby Atmos format, it will look like this on the Dolby renderer (the dedicated Atmos rendering suite), pay attention to the low number of channels used:

How to spatialize a stereo track?

The clues to detect poor stereo spatialization

Redundance or duplication of information
Stem separation
Reverb added
Few channels used on the Dolby Renderer view (for Dolby Atmos format)

How we spatialize from stereo at Ircam Amplify

What triggered our vision for enabling spatial audio from a stereo file is the famous IRCAM lab’s software suite for spatialization called SPAT. And through it the decisive step to setting up a commercial offshoot of the groundbreaking research lab: Ircam Amplify. The SPAT technology relies on a psycho-acoustic description of how our ears listen in a spatial environment, instead of the traditional physical-geometrical approach. It made us take another route than the usual spatialization process aiming to distribute the signal on a restricted number of speakers. We could take down physical barriers, remove walls, creating a continuous non-reverberating space, as if we were positioned in the middle of the desert.

From the stereo, we create a sound bubble around the listener. It is composed of a multitude of particles made up of the original stereo, therefore originating from the same source. However, each one will play a different role by performing its own part; some will define the stereo scene more precisely, while others will enhance the immersive effect by applying the same psychoacoustic phenomena derived from SPAT.

It’s downmixing time

As a result the listener will be surrounded by a continuity of immersive elements. Finally, to make the content compliant with industry formats like Atmos, we downmix it to distribute it between the physical speakers and the objects.

Sound particles surrounding the listener

What you can expect

An immersive effect from stereo you won’t find anywhere else.

We even added personalization settings per genre, because we’re music fans first and foremost and not all genres are equal as spatialization is concerned.

All (most, depending on the personalization we apply) the channels and objects receive data for rendering, like this:

How to spatialize a stereo track?

Enough words, time to listen

👉 Sign up and give it a try through the “Tasks” feature on your user dashboard

Think we're on the same wavelength?

Get in touch