Silence suppression in speech synthesis systems is achieved by detecting and processing only segments of voice activity. A segment is classified as "speech" if the energy of the signal is greater than an adaptively adjusted threshold. The adaptively adjusted threshold is preferably defined as the maximum of scaled values of two separate envelope parameters, which both track the variation in energy over the sequence of frames of speech data. One contour is a slow-rising fast-falling value, which is updated only during unvoiced speech frames, and therefore track a lower envelope of the energy contour. This parameter in effect tracks an ambiant noise level. The other parameter is a fast-rising slow-falling parameter, which is updated only during voiced speech frames, and thus tracks an upper envelope of the energy contour. (This in effect tracks the average speech level.) A nonsilent energy tracker and a silent energy tracker adjust corresponding energy values representing the energy contours.