Visual Attention in Active Vision Systems Attending, Classifying and Manipulating Objects

Detta är en avhandling från Stockholm : KTH Royal Institute of Technology

Sammanfattning: This thesis has presented a computational model for the combination of bottom-up and top-down attentional mechanisms. Furthermore, the use for this model has been demonstrated in a variety of applications of machine and robotic vision. We have observed that an attentional mechanism is imperative in any active vision system, machine as well as biological, since it not only reduces the amount of information that needs to be further processed (for say recognition, action), but also by only processing the attended image regions, such tasks become more robust to large amounts of clutter and noise in the visual field.Using various feature channels such as color, orientation, texture, depth and symmetry, as input, the presented model is able with a pre-trained artificial neural network to modulate a saliency map for a particular top-down goal, e.g. visual search for a target object. More specifically it dynamically combines the unmodulated bottom-up saliency with the modulated top-down saliency, by means of a biologically and psychophysically motivated temporal differential equation. This way the system is for instance able to detect important bottom-up cues, even while in visual search mode (top-down) for a particular object.All the computational steps for yielding the final attentional map, that ranks regions in images according to their importance for the system, are shown to be biologically plausible. It has also been demonstrated that the presented attentional model facilitates tasks other than visual search. For instance, using the covert attentional peaks that the model returns, we can improve scene understanding and segmentation through clustering or scattering of the 2D/3D components of the scene, depending on the configuration of these attentional peaks and their relations to other attributes of the scene. More specifically this is performed by means of entropy optimization of the scence under varying cluster-configurations, i.e. different groupings of the various components of the scene.Qualitative experiments demonstrated the use of this attentional model on a robotic humanoid platform and in a real-time manner control the overt attention of the robot by specifying the saccadic movements of the robot head. These experiments also exposed another highly important aspect of the model; its temporal variability, as opposed to many other attentional (saliency) models that exclusively deal with static images. Here the dynamic aspects of the attentional mechanism proved to allow for a temporally varying trade-off between top-down and bottom-up influences depending on changes in the environment of the robot.The thesis has also lay forward systematic and quantitative large scale experiments on the actual benefits and uses of this kind of attentional model. To this end a simulated 2D environment was implemented, where the system could not “see” the entire environment and needed to perform overt shifts of attention (a simulated saccade) in order to perfom a visual search task for a pre-defined sought object. This allowed for a simple and rapid substitution of the core attentional-model of the system with comparative computational models designed by other researchers. Nine such contending models were tested and compared with the presented model, in a quantitative manner. Given certain asumptions these experiments showed that the attentional model presented in this work outperforms the other models in simple visualsearch tasks.