Multimedia conferencing systems are provided and include a plurality of audio/video terminals which are coupled together via a telecommunications network, with the network including switches and an audio bridge. The audio/video terminals are provided with interface modules which receive local audio and video signals, process the signals, and provide separate streams of properly formatted audio data and video data to the network. The video data is switched in the preferably ATM network (i.e., routed) to its desired destination, while the audio data is first routed to the audio bridge for mixing, and then to the desired destination. At the desired destination, the separate audio and video signals are processed and synchronized by the interface module of the destination and provided to the audio/video terminal. Various different synchronization methods for the audio and video data streams are disclosed. In a simple synchronization method, a fixed delay of, e.g., 65 milliseconds is added to the audio stream. In other synchronization methods, time stamps are utilized to determine the video, or video and audio coding delays, and the video or video/audio delta coding delay is used to delay the audio stream. A preferred aspect of the invention involves mapping the audio and video data streams into ATM cells, and utilizing the ATM network for switching or multicasting the video data stream.