Systems and processes are disclosed for controlling television user interactions using a virtual assistant. A virtual assistant can interact with a television set-top box to control content shown on a television. Speech input for the virtual assistant can be received from a device with a microphone. User intent can be determined from the speech input, and the virtual assistant can execute tasks according to the user's intent, including causing playback of media on the television. Virtual assistant interactions can be shown on the television in interfaces that expand or contract to occupy a minimal amount of space while conveying desired information. Multiple devices associated with multiple displays can be used to determine user intent from speech input as well as to convey information to users. In some examples, virtual assistant query suggestions can be provided to the user based on media content shown on a display.