Communication is vital for one’s survival. Even animals communicate to a limited extent. Some neuro-scientists and philosophers believe that one’s self-awareness as an individual has itself evolved because of spoken communication. Adequate stimulation and a good communication environment are required for a child’s cognitive development. Speech communication has preceded written communication. In some ancient cultures knowledge has been propagated through generations by oral tradition (like in Vedic tradition.) An illiterate person can speak though has no knowledge of phonemes or alphabets.

Language has two major components:

Words: A word is unit of a language that carries a meaning. Sometimes one and the same word carries different meanings. The actual meaning is derived from the context. Words may be expressed physically using a sequence of speech sounds or a sequence of alphabets as in writing or by one or more gestures as in a sign language. A sequence of spoken words forms an utterance. Communication using speech sounds is referred to as speech communication. Improper pronunciation of speech sounds is a speech disorder. If speech order has arisen due to a lack of awareness and not due to neuro-muscular deficiency, it can be corrected by means of computerized speech training.

Grammar: A sentence consists of a group of words put in proper order. Grammar binds the inter-relationship between words in a sequence of words to bring out an unambiguous meaning. This is a mental or a cognitive activity. Grammar is common to both spoken and written languages. Improper use of Grammar is a language disorder.

In order to identify if spoken speech is normal or can be improved or has a deficiency, one has to be aware of the process of normal speech production, which is explained below. See Section on ‘Speech Disorders and Training’ for the different types of speech disorders. Certain speech disorders may be corrected by means of computerized speech training using software such as Vagmi. See under ‘Speech Trainer Modules in Vagmi’.

Sounds of speech are used to form words of a language. Although a human being can vocalize a variety of sounds, only a finite set of sounds (about 50) are recognized as speech sounds of a language. However, the sequence of speech sounds can be varied. Though any number of sequences is possible, only certain sequences of speech sounds form meaningful words. Speech sounds are specific to a language though some speech sounds are heard as very similar across languages. While speaking, transition from one speech sound to another is smooth and continuous.

In a written language a word consists of a sequence of alphabets. The alphabets or a combination of alphabets represent speech sounds. One and the same alphabet may represent different speech sounds (Ex. ‘c’ as in ‘cat’, ‘cell’; ‘th’ as in ‘thin’ and ‘this’). Specialists use internationally recognized special symbols called phonetic alphabets to avoid such an ambiguity.

Sound and Space: Consider an example of ‘jala tarang’. Water is filled in empty glass cups to different proportions. A sound is produced when the rim of the glass cup is gently stuck. The sound quality of the note so produced depends on the quantity of water in the cup. By varying the amount of space in the cup, different notes can be produced. In other words, the sound quality of a note depends on the volume of emptiness in the cup. We draw an analogy between this and production of speech sounds.

Articulation: The tongue, the lower jaw and the lips are called articulators. (For the role of velum, see below). Tongue itself can take several shapes: contour of the tongue or the tongue body lying low; tongue tip raising towards the upper teeth ridge; tongue curling backwards and reaching the roof of the mouth; and the tongue spread out horizontally and protruding towards the roof of the mouth.
The empty space from the larynx to the lips is called the vocal tract (VT). The complex 3-dimensional shape of the vocal tract can be altered by changing the positions of the articulators. Each distinct vocal tract shape is responsible in producing a distinct sound. Each of the articulators can be independently positioned at any one of a large number of practically feasible positions thus giving rise to a large number of sounds. However, only a finite set (so called target positions of the articulators) are used to produce speech sounds.

Vowels and Consonants: Broadly, speech sounds are classified as vowels and consonants. During the production of a vowel, (Ex. ‘aa’, ‘ee’, ‘oo’) VT shape has a relatively greater opening and there is no impediment for free flow of air from the lungs to the lips. On the other extreme, during the production of so called stop consonants (‘p’, ‘t’, ‘k’), air flow is completely stopped momentarily. For example, during production of ‘p’, both the lips are held together sealing the air passage. For other consonants, the opening in VT is in-between those for the vowels (relatively wide open) and the stops (sealed air passage).

Role of Velum: Velum is also an articulator. During the production of nasal sounds such as ‘m’ or ‘n’, velum allows air (sound waves) to pass through the nasal tract. During non-nasals, velum seals the air passage to the nasal tract.

Articulatory Dynamics: During speech production, articulators move continually from one set of positions to another. Such a movement is called articulatory dynamics. Some sounds can be produced in a sustained manner (Ex. vowels, ‘m’, ‘s’). Some sounds are transitory (Ex. ‘k’, ‘p’, ‘t’). During speech production, articulators move relatively slowly for sustained sounds and rather rapidly during transient sounds.

Co-articulation: Imagine to be producing syllables ‘key’ and ‘koo’. Observe the shape of the lips. To say ‘key’, for the initial ‘k’ sound itself, the lips are spread out as required for the following vowel ‘ee’. To say. ‘koo’, for the initial ‘k’ sound itself, the lips are rounded as required for the following vowel ‘oo’. This is called anticipatory co-articulation. In this example we say that lip shape is not critical for ‘k’ sound. This helps in a fluent speech production as the transition from one sound to another will be smooth. Also, it helps in a faster rate of speech production. In the absence of co-articulation, speech would have sounded very artificial, as if jumping from one sustained sound to sustained another ('telegraphic speech').

Acoustically, the initial ‘k’ in these two syllables,‘key’ and ‘koo’, differs in terms of spectral characteristics. However, listeners have no difficulty in recognizing the initial ‘k’. Also, as the tongue moves from the position of ‘k’ to that of the target positions of the following vowel, spectral characteristics change continually even during the vowel.

Similarly, during the production of a nasal sound, the velum may begin to open during the preceding sound itself and / or the velum may close during the following sound. Thus we have nasalized vowels preceding or following a nasal. This also is a case of coarticulation.

Manner of Production: A string instrument doesn’t produce musical notes unless it is plucked. The mechanical energy of plucking action sets the string to vibrate and thereby produce audible sound. One can silently move the articulators without producing an audible sound. Acoustic energy is required in order to produce audible sounds. In speech production, air from lungs is the main supply of energy. However, a steady flow of air (as in blowing) doesn’t produce audible frequencies required to produce audible speech sounds. The steady flow of air from the lungs is disturbed by means voluntary action. Such a disturbance is done in three distinct manners: (a) Voicing (b) Frication and (c) Bursts.

Voicing: Air flow from the lungs in interrupted almost cyclically by means of vibrating vocal folds. See ‘How is voice produced?’ in ‘Voice Awareness'. These interrupted ai flow pulses (glottal pulses) are modified by the shape of the vocal tract to produce voiced speech sounds. All vowels are voiced. Semi-vowels (Glides: ‘l’, ‘r’, Liquids: ‘y’, ‘w’) and nasals (‘m’, ‘n’, ‘ng’) are also voiced.

Frication: Vocal folds are held apart (abducted) and a steady air flow is allowed to escape through the glottis. A partial obstruction is formed within the vocal tract allowing air to flow through a very narrow passage. A jet of air with a hissing noise is produced. This turbulent noisy air is modified by the VT shape to produce what are called fricative sounds. For example, during production of ‘s’ sound, the tongue tip is held against the upper teeth with a narrow passage. Sounds such as ‘s’, ‘sh’, ‘ch’, 'f' are fricatives.. During production of some sounds such as ‘z’, vocal folds are kept open and at the same time partial length of vocal folds are set into cyclic motion and a narrow passage is formed within VT as for ‘s’ sound. Such sounds have both voicing and frication.

Bursts: Vocal folds are held apart (abducted) and a steady air flow is allowed to escape through the glottis. A complete obstruction is formed within the vocal tract. Flow of air is stopped to build-up pressure behind the obstruction. Then suddenly the pressure is released. This is similar to blowing a balloon and then pricking it with a pin. Sounds produced with this manner of acoustic energy are called unvoiced stop sounds (Ex. ‘k’, ‘t’, ‘p’). In the production of voiced stop sounds (Ex. ‘g’, ‘d’, ‘b’), during the pressure build-up, vocal folds are kept open and at the same time partial length of vocal folds are set into cyclic motion. Air pressure is built-up behind the obstruction. Such sounds have both voicing and explosion.

Place of Articulation: Sound quality depends on the place of the obstruction within the VT. There are five different places for stops:
Velars (‘k’, ‘g’): Back part of the tongue forms a hump and raises to the roof of the mouth
Retroflex: (‘t’ as in take, ‘d’ as in ‘do’): Tongue is curled upwards and raises to the roof of the mouth.
Palatal: (‘ch’ and ‘jh’): Tongue tip held against the upper palate.
Dental (‘th’ as in ‘thin’, ‘th’ as in ‘this’): Tongue tip held against upper teeth ridge.
Labial (‘p’ and ‘b’): Closure of lips.
Tongue position for nasal sounds: for ‘n’, it is the same as in a dental stop; for ‘m’, it is the same as in a labial stop; for ‘ng’, it is the same as in a velar stop.

Stop sounds are transitory. Several micro-intervals are recognized during the production of stops:

(i) Closure Duration (CD): Interval during which air pressure is built-up behind the obstruction within the VT. There is no sound during this interval. It is almost like a silence or pause. However, for voiced stops, during the CD, vocal folds are partially vibrating. There is a barely an audible sound during CD for a voiced sound. During ‘b’, when exaggerated, sound may be heard because of vibrating cheek walls.

(ii) Stop release of Burst: Interval when the built-up pressure is suddenly released as in an explosion. A large amplitude transient noise like signal is produced during this interval. The spectrum of the burst depends both on the place of production and the following vowel.

(iii) Aspiration Interval: After the release of stop burst, there is a finite interval before the vocal folds are to set into cyclic vibratory movement to produce the voicing for the following vowel. During this interval, a low level noise, barely audible, is sometimes produced at the glottis. This is referred to as aspiration interval.

In some languages (many Indian Languages), the aspiration noise is intentionally made audible to produce what are called aspirated stops (Ex. ‘bh’ as in ‘Bharat’).

Voice Onset Interval (VOT): The interval between the instant of release of the built-up pressure to the initiation of voicing for the following vowel is called voice onset time or VOT. VOT differs for different stops. It is language dependent. This is an important parameter for inferring the ability to build-up the pressure as well as to infer the agility of the articulators. For voiced stops, since vocal folds are partially vibrating during the closure interval, there is no need for initiating voicing after the release. By convention, VOT for voiced stops is referred to as negative and corresponds to the CD itself.

An Utterance: Like a garland made of different flowers, an utterance ties together a number of words to form a wholesome structure. An utterance consists of a fluent and natural style of saying a sequence of words using proper intonation, rhythm, duration of syllables etc. This is in contrast to reading one word at a time with a relatively longer pause between words. A pause is explicitly shown between words in writing. In a spoken utterance such pauses between words are not evident. Spoken language has an informal style of presentation with a greater fluency whereas reading follows a strict rhythm with a noticeable pause between words and a longer pause between sentences.

Differences in the style of presentation of a spoken language having a common grammar give rise to different dialects.

The term prosodic features or supra-segmental features are used to signify the style of presentation of an utterance. Naturalness of an utterance is governed by the following attributes.

Intonation: During speech production, the fundamental frequency or pitch is varied over the entire duration of an utterance. Strictly, F0 is defined only for voiced speech sounds. A plot of F0 Versus time is called F0-contour. However, intonation, a perceptual attribute, is a hypothesized continuous curve over the entire utterance. The intonation pattern is language dependent and also depends on the syntactic structure of the sentence. Intonation also depends on the type of a sentence: Assertive, interrogative, exclamatory etc. With emotion, the average F0-level shifts.

Linguistic Stress, Emphasis, Tones: In some languages, the same sequence of speech sounds (phonetically same sequence) carry different meanings depending on the pitch level over the vowel. For example in English, the noun versus verb distinction of words like ‘judge’, ‘object’ are formed by a change of stress. Also, in languages like English, there are local changes in F0 over a specific syllable of a word.

Emphasis: When a listener's attention is to be drawn to a specific word that word is spoken with an extra intensity and may be increased F0. This is called emphasis. Stress and emphasis share some common acoustic attributes. Hence, sometimes stress is used to signify emphasis even in languages which are not stress based.

In some so called ‘tonal’ languages the meaning of a word changes depending on the F0-contour used to pronounce the word.

Rhythm: If one can imagine clapping to synchronize with the timing of spoken sounds, one clap is usually associated with one syllable. Thus words like ‘one’, ‘two’, ‘sky’ etc have one syllable (one clap) whereas words like ‘seven’ (‘se’-‘ven’) has two syllables. A word like ‘elephant’ can be timed as ‘e’-‘li’-‘phant’, which comprises three syllables. In English, the dominant syllable structure is CV or CVC where C signifies a consonant and V signifies a vowel. In Indian languages the dominant pattern is CVCV (Ex. Hari, Rama etc).

In stress dominant languages like English, the temporal spacing between the stressed vowels in syllables defines the rhythm. In Indian languages, the temporal spacing between syllables determines the rhythm.

Duration Rules: A given utterance can be spoken either rapidly or slowly. When producing speech at different rates, either the duration of both C and V are varied in equal proportion or the duration of C is kept constant and only the duration of V is changed. The rule is language dependent.

Introduction

Production of Speech Sounds

Production of an Utterance

Attributes of Natural Fluent Speech