2,437 — that is the number of individual reflections a sound undergoes in the first 200 milliseconds inside Vienna's Musikverein, the concert hall consistently ranked among the finest in the world. Each reflection arrives at a listener's two ears with a slightly different timing, amplitude, and frequency spectrum, and the human auditory system decodes this intricate pattern into a three-dimensional perception of space, warmth, clarity, and envelopment. This perception cannot be reduced to a single number like RT60 or STI. It must be heard.
Auralization is the technology that makes this possible before a single brick is laid. By combining a simulated room impulse response with an anechoic audio recording, auralization allows architects, acoustic consultants, and clients to listen to a proposed room design through headphones and hear how speech, music, or any sound source would sound in the completed space. This guide covers the complete technical framework: from the physics of how humans localise sound (HRTFs), through the generation of binaural room impulse responses, to the practical workflow of creating auralizations for architectural design evaluation.
Part 1: How Humans Perceive Spatial Sound
The Two-Ear Advantage
Human spatial hearing relies on two physical mechanisms:
Interaural Time Difference (ITD): A sound source to the left of the listener reaches the left ear before the right ear. The maximum ITD is approximately 0.7 milliseconds (for a source at 90° azimuth), corresponding to the acoustic path length difference around the head (approximately 24 cm at 343 m/s). The auditory system uses ITD primarily for localising low-frequency sounds (below approximately 1500 Hz) where the wavelength is long enough for the phase difference between ears to be unambiguous.
Interaural Level Difference (ILD): The head casts an acoustic shadow, attenuating sound at the far ear. The shadow effect is frequency-dependent: at 200 Hz (λ = 1.7 m), the head is acoustically transparent (ILD ≈ 0 dB). At 6000 Hz (λ = 0.057 m), the head attenuates the far-ear signal by 10–20 dB. ILD is the dominant localisation cue above approximately 1500 Hz.
Head-Related Transfer Function (HRTF)
The HRTF is the complete frequency-dependent transfer function from a point in space to the eardrum, including all diffraction, reflection, and resonance effects of the pinnae (outer ears), head, torso, and shoulders. Each individual has a unique HRTF determined by their head size, ear shape, and upper body geometry.
Mathematically, the HRTF is defined as:
H(f, θ, φ) = P_ear(f, θ, φ) / P_free(f)
where P_ear is the sound pressure at the eardrum for a source at azimuth θ and elevation φ, and P_free is the sound pressure at the same point in free field (without the listener present). The HRTF is a complex-valued function — it contains both magnitude and phase information.
A complete HRTF dataset contains transfer functions for hundreds of source directions. The MIT KEMAR HRTF dataset (one of the most widely used) contains 710 measurements at elevations from -40° to +90° in 10° steps, with azimuthal resolution of 5° at ear level. More recent datasets (CIPIC, HUTUBS, ARI) provide finer spatial resolution.
HRTF Personalisation vs Generic HRTFs
Individual HRTFs vary significantly — particularly in the pinna resonances that provide elevation cues (the notches and peaks between 4,000 and 16,000 Hz). Using a non-individualised HRTF can cause front-back confusion (sounds intended to be in front are perceived as behind) and reduced elevation perception.
For architectural auralization, generic HRTFs (from dummy heads like the Neumann KU 100 or the KEMAR) provide adequate spatial impression for most applications. Personalised HRTFs improve realism but require measurement with in-ear microphones or estimation from ear photographs (a developing research area).
Part 2: Room Impulse Responses
Monaural Room Impulse Response (RIR)
A monaural RIR h(t) is the time-domain signal that characterises a room's acoustic response from a specific source position to a specific receiver position. It contains three temporal regions:
- Direct sound (0–5 ms): The first-arriving wavefront, unaffected by room reflections. Amplitude determined by source-receiver distance and geometric spreading.
- Early reflections (5–80 ms): Discrete reflections from nearby surfaces (walls, ceiling, floor). These reflections reinforce speech intelligibility (per ISO 3382-1 §4.3, the 50 ms boundary defines clarity C50 for speech and the 80 ms boundary defines C80 for music).
- Late reverberation (80 ms – RT60): A dense, statistically random decay of energy. The decay rate determines the RT60. The transition from discrete early reflections to diffuse late reverberation occurs at the mixing time, typically 50–100 ms after the direct sound.
Binaural Room Impulse Response (BRIR)
A BRIR consists of two impulse responses — one for the left ear and one for the right ear. Each reflection in the room's response is filtered by the HRTF corresponding to the direction from which it arrives at the listener. The result is a pair of signals that, when convolved with an anechoic source and played through headphones, reproduces the spatial sound field at the listener's ear positions.
The BRIR at the left ear for a source at position s and listener at position r is:
h_left(t) = h_direct(t) HRTF_left(θ_direct, φ_direct) + Σ h_reflection_i(t) HRTF_left(θ_i, φ_i)
where * denotes convolution, and (θ_i, φ_i) is the arrival direction of each reflection at the listener.
In practice, BRIRs are generated either by simulation (ray tracing or image source method producing reflection directions, which are then filtered by HRTFs) or by measurement (recording in the actual room using a binaural dummy head or in-ear microphones).
Part 3: Auralization — The Convolution Process
The Fundamental Operation
Auralization is conceptually simple: take a "dry" (anechoic) audio recording and convolve it with the room impulse response to produce a "wet" (room-affected) signal.
y(t) = x(t) * h(t)
where x(t) is the anechoic source signal and h(t) is the room impulse response. In the frequency domain, this becomes a multiplication: Y(f) = X(f) × H(f). Modern auralization software uses FFT-based convolution for efficiency.
For binaural auralization:
- y_left(t) = x(t) * h_left(t)
- y_right(t) = x(t) * h_right(t)
Anechoic Source Material
The quality of auralization depends critically on the source material. Options include:
- Anechoic speech recordings: Male and female speech recorded in an anechoic chamber. Standard signals include the Harvard sentences (IEEE 1969), the International Speech Test Signal (ISTS, Holube et al. 2010), and phonetically balanced word lists. These are available from audio research databases (IRCAM, DTU, TU Berlin).
- Anechoic music recordings: Individual instruments or ensembles recorded in anechoic conditions. The Denmarkk TU Anechoic Orchestral Recordings and the EBU SQAM (Sound Quality Assessment Material) are widely used. Complete orchestral recordings in anechoic conditions are rare — most auralization uses close-miked studio recordings with room ambience removed.
- Synthetic audio: Generated signals (pink noise, speech-shaped noise) for technical evaluation rather than perceptual presentation.
Real-Time vs Offline Auralization
Offline auralization: The complete BRIR is pre-computed (from simulation or measurement), convolution is performed on the full source signal, and the result is saved as an audio file. This produces the highest quality but does not allow interactive parameter changes. A 30-second speech excerpt convolved with a 2-second BRIR takes approximately 0.5–2 seconds to compute.
Real-time auralization: The convolution is performed in real-time using a block-processing approach (overlap-add or overlap-save FFT convolution). This enables the designer to change room parameters (material absorption, source position, receiver position) and hear the effect immediately. Real-time auralization requires computational resources but is now achievable on standard hardware with buffer sizes of 256–1024 samples (6–23 ms latency at 44.1 kHz).
AcousPlan provides both modes: pre-rendered WAV files for download and sharing, and browser-based real-time auralization using the Web Audio API with convolver nodes.
Part 4: Standards for Auralization
AES69-2015 — AES Standard for File Exchange
AES69-2015 (AES Standard for File Exchange — Spatial Acoustic Data File Format) defines a standardised format for storing room impulse responses and HRTFs. The SOFA (Spatially Oriented Format for Acoustics) format encodes impulse response data with associated metadata including source position, receiver position, sampling rate, and coordinate system. SOFA files are supported by MATLAB, Python (pysofaconventions), and major auralization software.
ISO 3382-1:2009 §6 — Measurement for Auralization
ISO 3382-1 §6 specifies that impulse response measurements intended for auralization must be conducted with an omnidirectional source and either an omnidirectional receiver (for monaural auralization) or a binaural receiver (for binaural auralization). The source must produce a signal with sufficient signal-to-noise ratio (> 45 dB) and dynamic range to capture both early reflections and late reverberation. Exponential sine sweeps (ESS) are the preferred excitation signal.
ITU-R BS.1116-3 — Listening Tests
ITU-R BS.1116-3 defines the methodology for subjective listening tests to evaluate auralization quality. It specifies: listening room conditions (RT60 < 0.4 s, background noise < NC 15), headphone equalisation, listener training, test signal selection, and statistical analysis of results. This standard is essential for research auralization but rarely applied in architectural practice.
Part 5: Multi-Source Auralization
Real rooms contain multiple sound sources — a speaker and a projector in a meeting room, an orchestra with 70+ instruments in a concert hall, multiple speakers and audience noise in a classroom. Multi-source auralization models each source independently, generating a separate BRIR for each source position, and convolves each source signal with its corresponding BRIR.
y_left(t) = Σ x_n(t) * h_left_n(t)
where n indexes the sound sources. AcousPlan supports up to 5 simultaneous sources (tier-dependent: Free 1, Pro 3, Studio 5), each positioned independently in the room model. The individual convolved signals are summed to produce the final binaural output.
Spatial Audio Encoding
For immersive playback beyond headphones, auralization can be encoded in:
- Ambisonics: Spherical harmonic encoding (first-order: 4 channels; higher-order: (N+1)² channels). Decoded to any loudspeaker configuration.
- Binaural over loudspeakers (crosstalk cancellation): Two-channel playback through stereo speakers with crosstalk cancellation filters to approximate headphone binaural.
- Object-based audio (Dolby Atmos, MPEG-H): Individual sources positioned as audio objects, rendered by the playback system for the specific speaker configuration.
Part 6: Worked Example — Conference Room BRIR Generation
Room: 8 m × 6 m × 3 m conference room (V = 144 m³) with acoustic ceiling (αw 0.90), plasterboard walls, carpet floor, and a 4 m² glass partition on one wall.
Step 1: Generate RIR using AcousPlan's statistical engine.
Using the Sabine equation: RT60 = 0.161 × 144 / A
| Surface | Area (m²) | α at 500 Hz | Absorption (m²) |
|---|---|---|---|
| Ceiling (acoustic) | 48 | 0.90 | 43.2 |
| Floor (carpet) | 48 | 0.15 | 7.2 |
| Walls (plasterboard) | 80 | 0.05 | 4.0 |
| Glass partition | 4 | 0.04 | 0.16 |
| Conference table (wood) | 6 | 0.10 | 0.60 |
| 12 upholstered chairs | 12 × 0.45 | — | 5.40 |
| Total | 60.56 |
RT60 = 0.161 × 144 / 60.56 = 0.38 seconds (occupied, mid-frequency)
Step 2: Generate synthetic impulse response.
AcousPlan generates a stochastic impulse response matching the predicted RT60 at each octave band, with early reflections computed from the image source method for the first 50 ms (rectangular room geometry enables exact calculation):
- Direct sound: 0 ms, 0 dB
- Floor reflection: 5.8 ms, -3 dB (low α of glass and carpet)
- Ceiling reflection: 4.4 ms, -8 dB (high α absorbs most energy)
- Side wall reflections: 6.2–9.5 ms, -2 to -4 dB
- Late reverberation onset: ~30 ms, exponential decay at rate determined by RT60
For a speaker at position (4, 0.5, 1.5) m and a listener at (4, 5, 1.2) m (seated, facing the speaker), the direct sound arrives from 0° azimuth, 5° elevation. Each early reflection arrives from a calculable direction based on the image source geometry. The HRTF for each direction is applied to create left and right ear impulse responses.
The resulting BRIR pair has:
- Length: 0.38 × 44100 = approximately 16,758 samples per channel
- Interaural differences: ITD of 0 µs for the direct sound (frontal source), varying ITD and ILD for lateral reflections
- Spatial impression: the listener perceives the speaker directly ahead, with reflections from the side walls creating a sense of room width
A 10-second anechoic speech sample is convolved with both BRIR channels. The resulting stereo audio file, when played through headphones, produces the impression of a speaker in a moderately damped conference room with short reverberation and good clarity. The STI of the auralized signal can be evaluated by running the STIPA analysis on the convolved output — expected value: approximately 0.72 (Good) based on the RT60 and assumed background noise.
Step 5: Compare design alternatives.
By modifying the room parameters (removing the acoustic ceiling, changing wall materials, adding more glass) and regenerating the BRIR, the designer can aurally compare:
- Conference room with acoustic ceiling vs exposed concrete soffit
- Effect of doubling the glass area (reflections increase, clarity decreases)
- Impact of adding wall panels on the rear wall
Part 7: Auralization Quality and Limitations
What Auralization Gets Right
- Reverberation character: The perceived liveness, warmth, and decay rate are accurately reproduced when the RIR is correct.
- Speech intelligibility: Auralized speech predicts whether real speech in the built room will be intelligible, with STI correlation ≥ 0.95 between auralized and measured results in validated studies (Vorländer, 2008).
- Comparative judgement: "Room A sounds better than Room B" is reliably perceived through auralization, even when absolute realism is imperfect.
What Auralization Gets Wrong
- Bone conduction: Real hearing includes vibration through the skull, which headphone auralization does not reproduce.
- Head movements: Static BRIR assumes the listener's head does not move. In real rooms, head rotation provides dynamic localisation cues. Head-tracked auralization (using head-tracking sensors and real-time HRTF interpolation) addresses this but adds complexity.
- Non-individualised HRTFs: Generic HRTFs cause front-back confusion in approximately 20–30% of listeners for specific source positions.
- Low-frequency perception: Bass frequencies are felt through the body and through the floor — headphone auralization reproduces only the airborne component.
- Visual-auditory interaction: How a room looks affects how it sounds (perceptually). Auralization without visual context may produce different subjective judgements than the same auralization with a VR visual model. Combined audio-visual auralization (using VR headsets with spatial audio) is the state of the art for client presentations.
Related Reading:
- Acoustic Modelling Methods: Ray Tracing vs Image Source vs Statistical vs AI — how room impulse responses are generated
- RT60 Complete Reference — the physics behind the decay that auralization reproduces
- STI Complete Technical Reference — how to evaluate speech intelligibility from auralized signals