A method and system are provided for time aligning speech. Speech data is input representing speech signals from a speaker. An orthographic transcription is input including a plurality of words transcribed from the speech signals. A sentence model is generated indicating a selected order of the words in response to the orthographic transcription. In response to the orthographic transcription, word models are generated associated with respective ones of the words. The orthographic transcription is aligned with the speech data in response to the sentence model, to the word models and to the speech data.