A system and method for enhancing interactive communication between video conferencing devices of the type in which a delay is inserted into the audio transmission path to provide lip synchronization of the image and speech of the respective users thereof. Each video conferencing device includes a display device for displaying images of at least one communicating party and a speech communicating system for communicating with the communicating party. In accordance with one embodiment of the invention, a speech detecting circuit detects an utterance by a first user of a first video conferencing apparatus. An audible or visual indication is provided to at least a second user of a second video conferencing apparatus before the utterance is reproduced. As a result, the potential for simultaneous speaking by two or more users is substantially reduced. In an alternate embodiment, the amount of delay introduced into the audio signal transmission path is adjusted in accordance with the mode of operation of the video conferencing devices. An audio signal processing system detects, over predetermined intervals, whether or not an interactive conversation between two or more users is in progress. If an interactive conversation is not detected, lip synchronization proceeds in a conventional manner by introducing a predetermined delay into the audio path. If an interactive conversation is detected, the amount of audio delay inserted is minimized until there is a return to the lecture mode of operation.