Next |
Prev |
Up |
Top
|
Index |
JOS Index |
JOS Pubs |
JOS Home |
Search
A formant synthesizer is a source-filter model in which the
source models the glottal pulse train and the filter models the
formant resonances of the vocal tract. Constrained linear prediction
can be used to estimate the parameters of formant synthesis models,
but more generally, formant peak parameters may be estimated directly
from the short-time spectrum (e.g., [257]). The filter in a
formant synthesizer is typically implemented using cascade or parallel
second-order filter sections, one per formant. Most modern rule-based
text-to-speech systems descended from software based on this type of
synthesis model [257,258,259].
Another type of formant-synthesis method, developed specifically for
singing-voice synthesis is called the FOF method [389].
It can be considered an extension of the VOSIM voice synthesis
algorithm [220].
In the FOF method, the formant filters are implemented in the
time domain as parallel second-order sections; thus, the vocal-tract
impulse response is modeled as a sum of three or so exponentially
decaying sinusoids. Instead of driving this filter with a glottal
pulse wave, a simple impulse is used, thereby greatly reducing
computational cost. A convolution of an impulse response with an
impulse train is simply a periodic superposition of the impulse
response. In the VOSIM algorithm, the impulse response was trimmed to
one period in length, thereby avoiding overlap and further reducing
computations.
The FOF method also tapers the beginning of the impulse-response using
a rising half-cycle of a sinusoid. This qualitatively reduces the
``buzziness'' of the sound, and compensates for having replaced the
glottal pulse with an impulse. In practice, however, the synthetic
signal is matched to the desired signal in the frequency domain, and the
details of the onset taper are adjusted to optimize audio quality more
generally, including to broaden the formant resonances.
One of the difficulties of formant synthesis methods is that formant
parameter estimation is not always easy [411].
The problem is particularly difficult when the fundamental frequency
is so high that the formants are not adequately ``sampled'' by
the harmonic frequencies, such as in high-pitched female voice
samples. Formant ambiguities due to insufficient spectral sampling
can often be resolved by incorporating additional physical constraints
to the extent they are known.
Formant synthesis is an effective combination of physical and spectral
modeling approaches. It is a physical model in that there is an
explicit division between glottal-flow wave generation and the
formant-resonance filter, despite the fact that a physical model is
rarely used for either the glottal waveform or the formant resonator.
On the other hand, it is a spectral modeling method in that its
parameters are estimated by explicitly matching short-time audio
spectra of desired sounds. It is usually most effective for any
synthesis model, physical or otherwise, to be optimized in the ``audio
perception'' domain to the extent it is known how to do this
[315,166]. For an illustrative example,
see, e.g., [202].
Next |
Prev |
Up |
Top
|
Index |
JOS Index |
JOS Pubs |
JOS Home |
Search
[How to cite this work] [Order a printed hardcopy] [Comment on this page via email]