Next to heart muscles, perhaps the most active muscles are the vocal folds in the larynx responsible for voice production. Every one must be aware of voice, more so professional voice users. Professional voice users are those persons use their voice extensively in their profession (teachers, advocates, politicians, shop keepers etc) or persons whose livelihood itself depends on the use of voice (like actors, singers, news readers, anchors etc). This awareness is to ensure that one is not abusing or misusing one’s voice inadvertently and to avoid long-term complications. Also awareness helps one to explore the possibility of realizing one’s full potential of the voice and make it pleasant and endearing to the listeners at the same time not straining one’s voice.

We discuss below some important and basic aspects of voice.

Voice is deduced on listening to a person’s speaking or singing. Voice of a person is independent of what is spoken or what is sung. Each person’s voice is unique. It is said that ‘beauty’ lies in the ‘eyes of the beholder’. We can say that ‘voice’ lies in the ‘ears of a listener’. Voice is an acoustic or sound image arising in the mind of a listener on hearing a person’s speech or singing. Voice quality ranges from ‘super-voice’ to ‘pathological-voice’ with lot of variation in-between having so called ‘normal voice’, which is highly individualistic.

Some believe that voice reflects a person's personality. A ‘good voice’ reflects speaker's confidence. It makes a positive impression on the listener. A vibrant voice can make the atmosphere lively. Of course, one could deduce if a voice is confident or depressed or cheerful etc.
Sounds are acoustic pressure waves in the atmosphere; systematic disturbance of air molecules in the atmosphere. Systematic disturbances in the air pressure are produced by modifying the air in the lungs by three physiological systems. Voice is produced by a well co-ordinated control of
These three systems are primarily used for vital or life-sustaining activities such as (i) breathing (ii) block food/liquid entering lungs and (iii) chewing food, respectively. These three systems have been adopted by human being for speaking and singing.

Normal breathing: breathing or respiration consists of two alternate phases: inspiration (filling-in of air into lungs) and expiration (exhaling air out of lungs). During normal breathing, inspiration is active and expiration is passive. The two phases are of approximately equal duration. Typically, we breathe about 16 times a minute.

Breath support for speaking/singing: During speaking / singing,air is filled into the lungs with a (silent) deep and rapid inspiration. This air is slowly released and converted into sound waves by the action of vocal folds of the larynx. That is, voice is produced during a controlled expiratory phase.

An analogy to a balloon is given to appreciate the sound production mechanism. When air is blown into a balloon and if the neck of the balloon is opened wide, air gushes out. This is comparable to normal expiration. On the other hand, if the neck of the balloon is closed firmly and only a small opening is formed, air in the balloon escapes as a jet slowly, setting the walls near the neck to vibrate and produce a sound. This is comparable to ‘voice’ production.
The filling-in of the air into lungs thereby building up the lung pressure followed by a prolonged controlled expiratory phase with a slow release of air is called 'breath support'.

A good breath support ensures:

Lack of good breath support results in:
Several factors relating to breath support are to be ensured so that one can speak/sing for a longer period in a loud and clear voice without unnecessary taking frequent breaths etc. The two main factors are Lung pressure and its control and Adequate quantity of air to be made available in the lungs for speaking/singing and making sure that it is used without wastage

In order to speak loud enough to be heard by another person, lung pressure has to be high. In order that the voice loudness be steady, the pressure has to be released very gradually or slowly. One has to ensure adequate supply of air by using abdominal breathing. The quantity of air available for speech production is referred to as vital capacity, which is measured indirectly by measuring a parameter called 'Maximum Phonation Duration (MPD)' or 'Maximum Phonation Time (MPT)'. The MPD is the maximum duration for which a steady vowel can be uttered in a single breath. Also S/Z ratio is another parameter used to infer the vital capacity. S/Z ratio is the ratio of the duration of a sustained 's' sound that can be produced in a single breath to the duration of sustained 'z' sound that can be produced in a single breath.

Abdominal breathing: As one inhales air, the abdominal walls must expand like a balloon. Put your palm on the abdominal wall to feel the expansion. During speaking / singing, the abdominal walls must slowly collapse. This pattern of breathing is generally recommended,

The combined action of lungs and larynx produce the ‘voice’. Larynx consists of two vocal folds which can be held apart or brought together by applying a muscular force. The air gap between the vocal folds is called the ‘glottis’. During normal breathing, glottis is wide open (vocal folds are far apart). Hence air easily enters the lungs and during exhalation air gushes out through the glottis as from a wide opened neck of a blown balloon.

When a person wants to speak or sing, air is filled into the lungs with a deep inhalation and the lung pressure is built-up. The involuntary and sudden expiratory phase as in breathing is inhibited. A force (medial tension) is applied and the vocal folds are brought together closing the glottis. The excess air pressure in the lungs gently rips open the vocal folds (horizontally separating the vocal folds) forming a narrow glottal opening. A jet of air escapes through the narrow glottis while vocal folds are moving apart. As the air escapes through the glottis, the pressure separating the folds vanishes. Initially, vocal folds were held together by applying an extra force (medial tension). Now, this force once again brings the vocal folds together to close the glottis, just as a spring door closes after it is pushed open. After some interval, the excess lung pressure once again separates out the folds.

This action of separating and closing of the vocal folds repeats cyclically in a cyclic manner, generating pulses of interrupted air flow through the glottis, which are called glottal pulses. These glottal pulses supply the energy for producing speech sounds. A signal that is representative of glottal pulses can be derived from the recorded audio signal of the sound in the atmosphere and it is called the ‘voice source’.

The interrupted pulses of air through the glottis (glottal pulses) travel through the mouth cavity and come out of lips as sound waves to produce audible sound. The shape of the mouth cavity can be altered to produce different speech sounds by changing the positions of the articulators (articulation or pronunciation). The phonetic quality of speech sounds is determined by the shape of the cavities above the glottis.

Efficient use of laryngeal system: Absence of leakage of air, complete closure of the glottis in every cycle, a relatively longer closed phase compared to an open phase, abrupt closure action.

An analogy to heart beats: There are three attributes to heart beats: loudness of heart beats, the rate of heart beats, and the sound quality associated with heart beats. You can hear the heart beats as strong or weak (loudness associated with the beats). If you listen to heart beats using a stethoscope, you hear rhythmic beats which you count as number of beats per minute. The number of beats per second is called frequency. One’s weight is expressed in kilos or pounds, height in inches or centi-meters. Similarly the frequency is expressed in cycles per second or Hertz (abbreviated as Hz). Typical heart rate is 72 per minute or (72/60) 1.2 cycles per second or 1.2 Hz. Heart beats of different persons sound different. An expert in cardiology can distinguish such differences in the quality of sound associated with the beats irrespective of the rate of beats.

Similarly, voice has three broad attributes:


Distinction between volume and pitch: Very often, people are not aware of these two dimensions of the voice. When a person is asked to increase the volume (speak louder), he/she increases the pitch and vice-versa. An awareness of this distinction as well as ability to independently control these two acoustic parameters (or dimensions) of voice is very important, especially for professional voice users like singers, actors etc. Voice exercises are useful in creating such an awareness and also to exercise an independent control of these two acoustic parameters.
Volume, Level, Intensity and Sound Pressure Level (SPL) in dB are all related terms. A loud voice has a higher volume compared to a soft voice. When you can hear and also understand what has been spoken by a speaker at a long distance, probably the speaker is shouting at the top of his/her voice. This is an example of a speaking at a very high volume or a very loud voice. For some speakers, you have to strain yourself to understand what is being spoken though you are standing at a very close distance. This is an example of a very soft voice. You can amplify a soft voice using an amplifier so that you can hear it from a distance. However, the quality of the voice is not the same as that of a loud voice. Scientifically speaking, loudness also depends on the dominant frequency in the sound. Thus a low level high frequency sound is heard as loud compared to a low frequency sound of the same level.

If a speaker is speaking too loud, it is a matter of concern as it may lead to abuse of voice. Too soft a voice is also not desirable. It may arise due to a breathy voice quality or an inefficient voice. These have to be corrected by voice training.

Physiologically, the strength (or the volume) of a voice depends primarily on the lung pressure and secondarily on the force of impact of vocal folds when they close at the end of every cycle of vibration. Also, the resonances of the vocal tract must be properly tuned to the harmonic structure of the voice for the sound quality to be ‘bright’.

The term ‘Level’ is used more commonly in connection with a recorded signal. A signal with a high level is usually louder than a signal level with low level. The term ‘Volume’ might have originated in the context of using devices like radio or TV where we turn the volume up or down thereby increasing the loudness level.

Technically, a measurable parameter of a sound in free field is the intensity (energy of acoustic signal per unit area).
Sound pressure level (SPL) is Intensity expressed as 10 times log10(I) and expressed in unit called in deci-Bell or dB. Log10(I) is Bell and ten times Bell is deci-Bell of simply dB.

The reason for using the log scale is as follows. Loudness is a perceived attribute of a sound related to the energy in a sound. When the energy of a sound is increased ten times, it is perceived as though the loudness is only doubled. In other words, the perceived loudness follows a logarithmic rule. Also, the ratio of energy of a very feeble sound to that of a very loud sound is very large (ex. 1000000). Logarithmic scale compresses the values. Thus a ratio of 1000000 is simply 6 in log to base 10 (log10).

Recorded signal level depends on many different factors: (i) Sensitivity of microphone (ii) Distance between the talker and the microphone (iii) Gain settings in an amplifier (iv) Software gain settings in a computer etc. In order that the SPL reading be comparable across different persons and different recording conditions, SPL measurement must be calibrated.

Calibrated SPL can be measured using a special meter called SPL meter. Since the sound level decreases with the distance (inverse square law), it is a standard practice to measure SPL at 1 meter distance. SPL measured at 10 cm is about 20 dB higher compared to SPL measured at 1 meter distance.

Loudness, a perceived attribute, in addition to signal level, is also related to the frequency content of a sound. For the same signal level, a high frequency sound is heard as louder than a low frequency sound. For example, a shrill cry of an infant is heard as louder compared to a bass voice of an adult male speaker, though the signal level of the cry and male voice may be the same. If the frequency content of a sound is also considered then the loudness is expressed in a unit called ‘phon’ or ‘sone’. However, intensity in dB is more commonly used.
Pitch is technically referred to as fundamental frequency. Fundamental frequency is denoted by the symbol F0 (read as "F-not" or ‘F-zero’ and not ‘F-Oh!’).
The number of cycles of vibration (opening-closing) of vocal folds per second (not per minute) is the fundamental frequency associated with voice.

F0-Level: F0 is not a constant value for a given speaker. F0 value can be altered voluntarily. During speaking, F0 is varied over an utterance, giving rise to intonation. During singing, a singer voluntarily changes F0 to produce different musical notes. When expressing emotions or when a speaker is under stress, F0-level changes.

When a person says a vowel at a comfortable level, the average value of F0 corresponds to the 'F0-level'. Usually, an adult male voice has a lower value of F0 level (90-130 Hz) compared to that of an adult female voice (180-240 Hz). A child’s voice has a very high F0-level (250-400 Hz). F0-level changes with age. During puberty, F0 drops compared to F0 used during childhood. This drop is much greater for a male compared to that of a female. Also, during old age, F0 usually drops.

Habitual F0 is the mean or the average of values of F0 over different voiced segments when a speaker reads a passage in an emotionally neutral state and at a comfortable level. This is the F0-level generally used in every day conversation. When we make comparative statements about F0 of different speakers we are implicitly comparing the habitual F0 values.

Puberphonia and Androphonia: When an adult male person sounds like having a female voice, retaining his pre-pubertic F0 level, it is called puberphonia. On the other hand, the voice of a female speaker sounding like an adult male speaker is called androphonia. Such cases come under functional voice disorders, unless it is an anatomical deformity. Use of inappropriate F0-level can be avoided by voice training.

Optimum pitch: Some voice specialists believe that there is an optimum pitch or optimum F0 for every speaker. By using F0-level as the optimum pitch, a speaker can speak or sing for a longer duration without straining the voice and the voice quality will appear to be ‘bright’. Empirically the optimum pitch or F0 is determined as follows: First, find the lowest F0 that can be produced clearly by a speaker. Find the key on a piano key-board that corresponds to this lowest F0. The optimum F0 will correspond to 2 or 3 notes higher.

Vagmi Voice Trainer modules can be used to measure the habitual pitch, the optimum pitch and also to correct puberphonia and androphonia.

The term ‘timbre’ (German word, pronounced as 'timb-ray') is also used to signify ‘voice quality’. It is important to recall the difference between the terms ‘voice source’ and ‘voice quality’. We infer voice quality by listening to speech or singing produced by a speaker. We can’t directly listen to the interrupted air flow coming out of the larynx. Hence, voice quality is determined by the combined effect of voice source and pronunciation habits.

Just as the saying goes ‘beauty lies in the eyes of the beholder’, listener’s choice of what is a good voice is subjective. This is similar to the taste one has for music. Hence there are no measurable units for quality of voice. Though, there is no single measurable measure to quantify voice quality, yet it is generally understood that voice quality is determined by the spectral characteristics and their dynamics and inter-relation of the three attributes. See general awareness on 'Voice Culture' for more details.
During speaking, speakers continually change their frequency/ pitch over vocalic regions of speech. When you say "How are you?” you are changing your fundamental frequency throughout the utterance. This change of fundamental frequency over an utterance is called intonation. If the change in pitch or fundamental frequency over an utterance is highly restricted then the voice sounds sleepy or monotonous. For a lively voice, there has to be a wide range over which both level of voice and pitch have to vary while rendering an utterance.

The average frequency over an utterance is the spoken fundamental frequency, which depends on the utterance, emphasis, emotion etc.