A behavior recognition system and method by combining an image and a speech are provided. The system includes a data analyzing module, a database, and a calculating module. A plurality of image-and-speech relation modules is stored in the database. Each image-and-speech relation module includes a feature extraction parameter and an image-and-speech relation parameter. The data analyzing module obtains a gesture image and a speech data corresponding to each other, and substitutes the gesture image and the speech data into each feature extraction parameter to generate image feature sequences and speech feature sequences. The data analyzing module uses each image-and-speech relation parameter to calculate image-and-speech status parameters. The calculating module uses the image-and-speech status parameters, the image feature sequences, and the speech feature sequences to calculate a recognition probability corresponding to each image-and-speech relation parameter, so as to take a maximum value among the recognition probabilities as a target parameter.