Embodiments of the present invention provide a method and a system for mapping a scene depicted in an acquired stream of video frames that may be used by a machine-learning behavior-recognition system. A background image of the scene is segmented into plurality of regions representing various objects of the background image. Statistically similar regions may be merged and associated. The regions are analyzed to determine their z-depth order in relation to a video capturing device providing the stream of the video frames and other regions, using occlusions between the regions and data about foreground objects in the scene. An annotated map describing the identified regions and their properties is created and updated.