Spectral Modeling Synthesis

The spectrum of a signal gives the distribution of signal energy as a function of frequency. — Click for http://ccrma.stanford.edu/~jos/mdft/Example_Applications_DFT.html

The Fourier transform of a signal is a frequency-domain function — Click for https://ccrma.stanford.edu/~jos/mdft/

Time scaling refers to modifying the time-duration of a signal without modifying its audio spectral characteristics. It is often used to slow down music without changing the pitch of any of the instruments. It is not possible to do this exactly in general, but good approximate methods exist using, e.g., the phase vocoder or Synchronous OverLap-Add (SOLA) algorithms. — Click for https://ccrma.stanford.edu/~jos/sasp/Time_Scale_Modification.html

A signal is typically a real-valued function of time. A discrete-time signal is typically a real-valued function of discrete time, and is therefore a time-ordered sequence of real numbers. — Click for http://ccrma.stanford.edu/~jos/filters/Definition_Signal.html

Transients are short intervals during which the signal evolves quickly in some nontrivial or relatively unpredictable way (Bello, 2005). In computer music, the short attack portion of a sound is often treated as a transient signal. — Click for http://en.wikipedia.org/wiki/Transient_%28acoustics%29

A filter in the audio signal processing context is any operation that accepts a signal as an input and produces a signal as an output. Most practical audio filters are linear and time invariant, in which case they can be characterized by their impulse response or their frequency response. — Click for https://ccrma.stanford.edu/~jos/filters/What_Filter.html

In signal processing, noise is a random signal. Broadband noise contains energy over a wide range of frequencies. — Click for http://en.wikipedia.org/wiki/White_noise

Noise (in signal processing) is a random signal. In the language of statistical signal processing, a noise signal is typically modeled as 'stochastic process', which is in turn defined as a sequence of random variables. — Click for https://ccrma.stanford.edu/~jos/sasp/What_Noise.html

The spectrum of a signal gives the distribution of signal energy as a function of frequency. — Click for http://ccrma.stanford.edu/~jos/mdft/Example_Applications_DFT.html

A sinusoidal model for sound approximates each tonal component of the sound as a sum of slowly varying sinusoids. For tonal sounds such as from vibrating strings or wind instruments (including voiced speech), a sinusoidal model can provide a compact, high-fidelity representation. In addition to providing an intuitive, malleable representation for sound, sinusoidal models are also used in advanced audio compression. — Click for https://ccrma.stanford.edu/~jos/sasp/Sinusoidal_Modeling_Sound.html

The amplitude envelope, or relatively slowly-changing outline of a sound waveform, makes for a useful first approximation of the instantaneous loudness of the sound. — Click for http://en.wikipedia.org/wiki/ADSR_envelope

Pitch is the perceived fundamental frequency of a sound. — Click for http://en.wikipedia.org/wiki/Pitch_(music)

A speech sound produced primarily by vibration of the vocal folds, with the vocal tract held in an open configuration. — Click for http://ccrma.stanford.edu/~rjc/pubs/audio_speech/Describing_Speech_Sounds.html

A sinusoid is any function of the form A sin(ω t+φ), where t is the independent variable, and A, ω, φ are fixed parameters of the sinusoid called the amplitude, (radian) frequency, and phase, respectively. Sinusoidal motion is produced by any 'pure' vibration, such as that of an ideal tuning fork or mass-spring system. — Click for https://ccrma.stanford.edu/~jos/mdft/Sinusoids.html

A periodic signal is a signal that forever repeats itself. — Click for http://en.wikibooks.org/wiki/Signals_and_Systems/Periodic_Signals

For a tone which posesses a perfectly harmonic overtone series, the nth harmonic is generally defined to oscillate at nf₀Hz, where f₀ is the fundamental frequency. — Click for http://en.wikipedia.org/wiki/Harmonic

Every periodic signal has a fundamental frequency given by the inverse of the period. If the period is in units of seconds, then the fundamental frequency is in units of Hertz (cycles per second). According to Fourier theory, every periodic signal can be expressed as a sum of sinusoids at frequencies given by integer multiples of the fundamental frequency. For integers greater than 1, these frequencies are called harmonic frequencies, and the sinusoids at harmonic frequencies are typically called harmonics. — Click for https://www.physicsclassroom.com/class/sound/Lesson-4/Fundamental-Frequency-and-Harmonics

A periodic signal is a signal that forever repeats itself. — Click for http://en.wikibooks.org/wiki/Signals_and_Systems/Periodic_Signals

Spectrum analysis of sound is analogous to decomposing white light into its component colors by means of a prism — Click for https://ccrma.stanford.edu/~jos/mdft/Example_Applications_DFT.html

Spectral Modeling Synthesis

This section reviews elementary spectral models for sound synthesis. Spectral models are well matched to audio perception because the ear is a kind of spectrum analyzer [293].

For periodic sounds, the component sinusoids are all harmonics of a fundamental at frequency $\omega_1$ :

Aperiodic sounds can similarly be expressed as a continuous sum of sinusoids at potentially all frequencies in the range of human hearing:^11.6

Sinusoidal models are most appropriate for ``tonal'' sounds such as spoken or sung vowels, or the sounds of musical instruments in the string, wind, brass, and ``tonal percussion'' families. Ideally, one sinusoid suffices to represent each harmonic or overtone.^11.7 To represent the ``attack'' and ``decay'' of natural tones, sinusoidal components are multiplied by an amplitude envelope that varies over time. That is, the amplitude in (10.15) is a slowly varying function of time; similarly, to allow pitch variations such as vibrato, the phase $\phi_k(t)$ may be modulated in various ways.^11.8 Sums of amplitude- and/or frequency-enveloped sinusoids are generally called additive synthesis (discussed further in §10.4.1 below).

Sinusoidal models are ``unreasonably effective'' for tonal audio. Perhaps the main reason is that the ear focuses most acutely on peaks in the spectrum of a sound [179,306]. For example, when there is a strong spectral peak at a particular frequency, it tends to mask lower level sound energy at nearby frequencies. As a result, the ear-brain system is, to a first approximation, a ``spectral peak analyzer''. In modern audio coders [16,200] exploiting masking results in an order-of-magnitude data compression, on average, with no loss of quality, according to listening tests [25]. Thus, we may say more specifically that, to first order, the ear-brain system acts like a ``top ten percent spectral peak analyzer''.

For noise-like sounds, such as wind, scraping sounds, unvoiced speech, or breath-noise in a flute, sinusoidal models are relatively expensive, requiring many sinusoids across the audio band to model noise. It is therefore helpful to combine a sinusoidal model with some kind of noise model, such as pseudo-random numbers passed through a filter [249]. The ``Sines + Noise'' (S+N) model was developed to use filtered noise as a replacement for many sinusoids when modeling noise (to be discussed in §10.4.3 below).

Another situation in which sinusoidal models are inefficient is at sudden time-domain transients in a sound, such as percussive note onsets, ``glitchy'' sounds, or ``attacks'' of instrument tones more generally. From Fourier theory, we know that transients, too, can be modeled exactly, but only with large numbers of sinusoids at exactly the right phases and amplitudes. To obtain a more compact signal model, it is better to introduce an explicit transient model which works together with sinusoids and filtered noise to represent the sound more parsimoniously. Sines + Noise + Transients (S+N+T) models were developed to separately handle transients (§10.4.4).

A advantage of the explicit transient model in S+N+T models is that transients can be preserved during time-compression or expansion. That is, when a sound is stretched (without altering its pitch), it is usually desirable to preserve the transients (i.e., to keep their local time scales unchanged) and simply translate them to new times. This topic, known as Time-Scale Modification (TSM) will be considered further in §10.5 below.

In addition to S+N+T components, it is useful to superimpose spectral weightings to implement linear filtering directly in the frequency domain; for example, the formants of the human voice are conveniently impressed on the spectrum in this way (as illustrated §10.3 above) [174].^11.9 We refer to the general class of such frequency-domain signal models as spectral models, and sound synthesis in terms of spectral models is often called spectral modeling synthesis (SMS).

The subsections below provide a summary review of selected aspects of spectral modeling, with emphasis on applications in musical sound synthesis and effects.

Spectral Modeling Synthesis

``Spectral Audio Signal Processing'', by Julius O. Smith III, W3K Publishing, 2011, ISBN 978-0-9745607-3-1. Copyright © 2022-02-28 by Julius O. Smith III Center for Computer Research in Music and Acoustics (CCRMA), Stanford University

``Spectral Audio Signal Processing'', by Julius O. Smith III, W3K Publishing, 2011, ISBN 978-0-9745607-3-1.
Copyright © 2022-02-28 by Julius O. Smith III
Center for Computer Research in Music and Acoustics (CCRMA), Stanford University