KlattSynthesizer


	links to this page:

KlattSynthesizer

Last updated at 10:19 am UTC on 24 September 2004

The Klatt synthesizer is a formant synthesizer with cascade and parallel configurations. The class KlattSynthesizer is used by the class KlattVoice.

References (from the class comment):
[1] Klatt,D.H. "Software for a cascade/parallel formant synthesizer", in the Journal of the Acoustical Society of America, pages 971-995, volume 67, number 3, March 1980.

[2] Klatt,D.H. and Klatt, L.C. "Analysis, synthesis and perception of voice quality variations among female and male talkers". In the Journal of the Acoustical Society of America, pages 820-857, volume 87, number 2. February 1990.

[3] Fant, G., Liljencrants, J., & Lin, Q. "A four-parameter model of glottal flow", Speech Transmission Laboratory Qurterly Progress Report 4/85, KTH.

[4] Alwan, A., Bangayan, P., Kreiman, J., and Long, C. "Time and Frequency Synthesis Parameters of Severely Pathological Voice Qualities."

Additional references:
Rutledge, J., Cummings, K., Lambert, D. & Clements, M. (1995), Synthesizing styled speech using the Klatt synthesizer, in `Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing', Vol. 1, Detroit, Michigan, USA, pp. 648–651.

Links

Luciano
Kismet, the robot (DecTalk 4.5 uses the KlattSynthesizer as well)
Another Klatt Synthesizer implementation http://www.phon.ucl.ac.uk/resource/sfs/help/man/klattsyn.htm
List of speech synthesis systems
University of Munich link list (in German) (seems to have moved here)

Usually, the cascade branch is used to synthesize vowel-like sounds, and the parallal branch is employed to synthesize consonantal sounds; but the parallel branch alone can be used to synthesize both vowel and consonant sounds. There are 52 time-varing parameters to control the synthesizer:

Excitation source (voice, aspiration and friction) parameters:
- f0 Fundamental frequency (in hz). Determines the glottal pulse length.
- flutter Amount of flutter (slow f0 fluctuation).
- jitter Amount of jitter (random variation in the length of the glottal pulse)
- shimmer Amount of shimmer (random variation in the amplitude of the glottal pulse)
- diplophonia Amount of diplophonia (periodic bimodal variation in the glottal pulse length)
- voicing Amplitude of voicing
- ro Relative duration of open phase of glottal pulse
- ra Relative duration of return phase of glottal pulse
- rk Simmetry of the glottal pulse
- aspiration Amplitude of aspiration
- friction Amplitude of friction
- turbulence Amplitude of turbulence (in open glottal phase)

Formants frequencies and bandwidths:
- f1 Fequency of 1st formant
- b1 Bandwidth of 1st formant
- df1 Change in F1 during open portion of period
- db1 Change in B1 during open portion of period
- f2 Frequency of 2nd formant
- b2 Bandwidth of 2nd formant
- f3 Frequency of 3rd formant
- b3 Bandwidth of 3rd formant
- f4 Frequency of 4th formant
- b4 Bandwidth of 4th formant
- f5 Frequency of 5th formant
- b5 Bandwidth of 5th formant
- f6 Frequency of 6th formant
- b6 Bandwidth of 6th formant
- fnp Frequency of nasal pole
- bnp Bandwidth of nasal pole
- fnz Frequency of nasal zero
- bnz Bandwidth of nasal zero
- ftp Frequency of tracheal pole
- btp Bandwidth of tracheal pole
- ftz Frequency of tracheal zero
- btz Bandwidth of tracheal zero

Parallel Friction-Excited:
- a2f Amplitude of friction-excited parallel 2nd formant
- a3f Amplitude of friction-excited parallel 3rd formant
- a4f Amplitude of friction-excited parallel 4th formant
- a5f Amplitude of friction-excited parallel 5th formant
- a6f Amplitude of friction-excited parallel 6th formant
- bypass Amplitude of friction-excited parallel bypass path
- b2f Bandwidth of friction-excited parallel 2nd formant
- b3f Bandwidth of friction-excited parallel 2nd formant
- b4f Bandwidth of friction-excited parallel 2nd formant
- b5f Bandwidth of friction-excited parallel 2nd formant
- b6f Bandwidth of friction-excited parallel 2nd formant

Parallel Voice-Excited:
- anv Amplitude of voice-excited parallel nasal formant
- a1v Amplitude of voice-excited parallel 1st formant
- a2v Amplitude of voice-excited parallel 2nd formant
- a3v Amplitude of voice-excited parallel 3rd formant
- a4v Amplitude of voice-excited parallel 4th formant
- atv Amplitude of voice-excited parallel tracheal formant

Overall gain:
- gain Overall gain

Background information on the Klatt Synthesizer see here.

An email (as a source for further documentation and example additions; to check if the things stated still apply in the most recent edition of Squeak)

Date: Sun, 04 Mar 2001 13:27:06 -0600
Subject: Voice Synthesis Question
From: "Harry E. Fassl"

I've been experimenting with synthesizing speech from text contained in a
file external to the image.

I've had to make two modifications to Speaker>>say. (See below)

The first change is the modification of the literal1000 to 110.
If I leave the code at 1000, the delay between reading the line and the
beginning of 'speaking' is unacceptably long.
My guess is that this is/was processor speed dependent and maybe needs an
instance variable, with a default at init time and a setter. (I'm running a
B&W G3 @450MHz)

The second change is to have Speaker>>say return events duration. This is
used to pause the Speaker between lines. Otherwise the next line read starts
playing before the previous line finishes.
(See the workspace code below below)

Running Squeak 3.0 image Latest update #3545, VM30Alph7MT

Questions are
Am I going to step on something else by making these changes?
(Methodolgy suggestions for researching this graciously accepted.)

Is there a better way to get the result I'm after?

Harry

H.E.Fassl
http://www.mcs.net/~hefassl

say: aString
| events stream string |
events _ CompositeEvent new.
stream _ ReadStream
on: (aString findTokens: '?' keep: '?').
[stream atEnd]
whileFalse: [string _ stream next.
stream atEnd
ifFalse: [string _ string , stream next].
events
addAll: (self eventsFromString: string)].
events playOn: self voice delayed: events duration 110. "– Modified from
1000–"
self voice flush.
^ events duration "– Returned for use with Delay to pause between lines.-"

WORKSPACE SNIPPET

voiceFile voiceText thisSpeaker thisVoice crcr timeDelay

Transcript clear.
SoundPlayer stopReverb.
crcr _ String with: Character cr with: Character cr.
voiceFile _ StandardFileStream oldFileNamed: 'Nativity.text'.
thisVoice _KlattVoice new tract: 11.7;breathiness: 0.32;shimmer: 0.1;ro:
0.7;rk:0.45;ra: 0.008.
thisSpeaker _ Speaker new voice: thisVoice.
thisSpeaker pitch: 210.0; speed: 0.613.
[voiceFile atEnd] whileFalse:
[voiceText _ ((voiceFile upTo: Character cr) copyReplaceAll: crcr with: '')
withBlanksTrimmed.
Transcript show: voiceText;cr.
timeDelay _ Delay forSeconds: (thisSpeaker say: voiceText) + 1. "Pause
between lines to avoid next line starting before this one is finished."
timeDelay wait.
].
voiceFile finalize.

Measuring performance
From: Phil Weichert
Subject: Speaker bigMan - Performance
Date: Sat, 18 Aug 2001 15:20:45 -0500

Performance enhancements ideas requested.

I have been look into the Voice synthesis stuff. In going through the
examples in the image such as Speaker bigMan say: 'You can cheat, but
don''t get caught.'
I noticed that the speech pattern is slow like a 45rpm record played at
33 1/3. I tried several other examples and they all do the same. I
would hope that a 500 mhz PC should handle this but apparently not. I
evaluated the following:
TimeProfileBrowser onBlock: [Speaker bigMan say: 'You can cheat, but
don''t get caught.']

In the leaves section, a tremendous amount of time is lots in "hash" and
process resume. I am rusty on hash. Any practical suggestions on
improving the performance of hash or any of the other parts. I would
like to heard bigMan Speaker speak at a normal tempo.



' - 58 tallies, 1031 msec.



Error: this should not happenTreeError: this should not happen

69.0% {711ms} Speaker>>say:

  |60.3% {622ms} CompositeEvent(VoiceEvent)>>playOn:delayed:

  |  |60.3% {622ms} CompositeEvent>>playOn:at:

  |  |  56.9% {587ms} PhoneticEvent>>playOn:at:

  |  |    |56.9% {587ms} KlattVoice>>playPhoneticEvent:at:

  |  |    |  56.9% {587ms} KlattVoice>>playEvent:segments:boundary:at:

  |  |    |    36.2% {373ms} KlattVoice>>playEvent:frames:at:

  |  |    |      |15.5% {160ms} KlattSynthesizer>>samplesFromFrames:

  |  |    |      |  |8.6% {89ms} primitives

  |  |    |      |  |6.9% {71ms} OrderedCollection>>do:

  |  |    |      |10.3% {106ms} KlattVoice(Voice)>>playBuffer:at:

  |  |    |      |  |10.3% {106ms} QueueSound(AbstractSound)>>play

  |  |    |      |  |  10.3% {106ms} SoundPlayer class>>playSound:

  |  |    |      |  |    10.3% {106ms} SoundPlayer class>>resumePlaying:



  |  |    |      |  |      10.3% {106ms} SoundPlayer

class>>resumePlaying:quickStart:

  |  |    |      |  |        10.3% {106ms} SoundPlayer

class>>startUpWithSound:

  |  |    |      |  |          10.3% {106ms} SoundPlayer

class>>startPlayerProcessBu...e:rate:stereo:sound:

  |  |    |      |  |            10.3% {106ms} Process>>resume

  |  |    |      |3.4% {35ms} KlattVoice>>dBFromLinear:

  |  |    |      |3.4% {35ms} KlattVoice>>linearFromdB:

  |  |    |      |  3.4% {35ms} SmallInteger(Number)>>raisedTo:

  |  |    |    20.7% {213ms} KlattVoice>>currentFramesCount:

  |  |    |      20.7% {213ms} KlattSegment>>left:right:speed:pattern:

  |  |    |        15.5% {160ms} KlattSegment>>slopeWith:selector:speed:



  |  |    |          13.8% {142ms} Dictionary>>at:

  |  |    |            13.8% {142ms} Dictionary>>at:ifAbsent:

  |  |    |              12.1% {125ms}

Dictionary(Set)>>findElementOrNil:

  |  |    |                12.1% {125ms} Dictionary>>scanFor:

  |  |    |                  8.6% {89ms}

Symbol(SequenceableCollection)>>hash

  |  |    |                    6.9% {71ms} primitives

  |  |  3.4% {35ms} KlattVoice>>flush

  |  |    3.4% {35ms} KlattVoice>>playEvent:segments:boundary:at:

  |  |      3.4% {35ms} KlattVoice>>currentFramesCount:

  |  |        3.4% {35ms} KlattSegment>>left:right:speed:pattern:

  |8.6% {89ms} Speaker>>eventsFromString:

  |  5.2% {54ms} Clause>>accept:

  |    |3.4% {35ms} F0RenderingVisitor>>clause:

  |  3.4% {35ms} Speaker>>clauseFromString:

  |    3.4% {35ms} Speaker>>phraseFromString:

  |      3.4% {35ms} Speaker>>wordFromString:

  |        3.4% {35ms} PhoneticTranscriber>>transcriptionOf:

  |          3.4% {35ms} PhoneticRule>>matches:at:

31.0% {320ms} Speaker class>>bigMan

  22.4% {231ms} KlattVoice class(Voice class)>>new

    |22.4% {231ms} KlattVoice>>initialize

    |  22.4% {231ms} KlattSegmentSet class>>arpabet

    |    22.4% {231ms} KlattSegmentSet>>initializeArpabet

  8.6% {89ms} Speaker class>>new

    8.6% {89ms} Speaker>>initialize

      6.9% {71ms} PhoneticTranscriber class>>default

        6.9% {71ms} PhoneticTranscriber class>>english

          6.9% {71ms} PhoneticRule class>>english



Error: this should not happenLeavesError: this should not happen

10.3% {106ms} String(SequenceableCollection)>>hash

10.3% {106ms} Process>>resume

8.6% {89ms} SmallInteger>>hashMultiply

8.6% {89ms} KlattSynthesizer>>samplesFromFrames:

6.9% {71ms} Dictionary>>scanFor:

6.9% {71ms} OrderedCollection>>do:

3.4% {35ms} KlattVoice>>dBFromLinear:

3.4% {35ms} Symbol>>=

3.4% {35ms} KlattSegmentParameter>>fixed:

3.4% {35ms} String class(Object)>>hash

'

Thanks,
Phil

Note: emacspeak may use DECTalk speech servers.