Task, Timing, and Representation in Visual Object Recognition

Task, Timing, and Representation in Visual Object Recognition

Albert L. Rothenstein (York University, Canada)
DOI: 10.4018/978-1-4666-2539-6.ch003

Abstract

Most biologically-inspired models of object recognition rely on a feed-forward architecture in which abstract representations are gradually built from simple representations, but recognition performance in such systems drops when multiple objects are present in the input. This chapter puts forward the proposal that by using multiple passes of the visual processing hierarchy, both bottom-up and top-down, it is possible to address the limitations of feed-forward architectures and explain the different recognition behaviors that primate vision exhibits. The model relies on the reentrant connections that are ubiquitous in the primate brain to recover spatial information, and thus allow for the selective processing of stimuli. The chapter ends with a discussion of the implications of this work, its explanatory power, and a number of predictions for future experimental work.
Chapter Preview
Top

Introduction

The study of visual perception abounds with examples of surprising results, and perhaps none of these has generated more controversy than the speed of object recognition. Some complex objects can be recognized with amazing speed even while attention is engaged on a different task. Some simple objects need lengthy attentional scrutiny, and performance breaks down in dual-task experiments (Koch & Tsuchiya, 2007). These results are fundamental to our understanding of the visual cortex, as they clearly show that not all stimuli are represented in the same way in the brain, and not all visual recognition tasks are the same.

Most if not all biologically-inspired models of object recognition rely on a feed-forward architecture in which abstract representations are gradually built from simple representations, e.g. (Hummel & Stankiewicz, 1996; Riesenhuber & Poggio, 1999b; Hummel, 2001; Serre et al., 2007), but recognition performance in feed-forward systems drops when multiple objects are present in the input. This has been demonstrated mathematically (Tsotsos, 1987, 1988; Rensink, 1989; Tsotsos, 1990; Grimson, 1990) by showing that pure data-directed approaches to vision (and in fact to perception in any sensory modality) are computationally intractable. Computational modeling studies (Walther et al., 2002; Kreiman et al., 2007) support this conclusion. Behavioral studies also confirm this, e.g. (Duncan, 1984; Behrmann et al., 1998; VanRullen et al., 2004), and extensive experimental evidence has been presented that neural responses in the inferior temporal (IT) visual area to an object are typically reduced when additional objects are presented in the neuron’s receptive field, e.g. (Sato, 1989; Miller et al., 1993; Rolls & Tovee, 1995; Missal et al., 1999; Zoccolan et al., 2005; Meyers et al., 2010).

Initial work on the HMAX (also called the “standard”) model of object recognition (Riesenhuber & Poggio, 1999a) claims no need for attention and binding, and, in fact, the feed-forward max-like mechanism prevents any effective top-down traversal since decisions regarding relevance for elements of a neuron’s receptive field are made early. A later extension shows that such a model can recognize objects with limited clutter but performance decreases rapidly when clutter is increased. This has led others to analyze the multi-stimulus performance of the HMAX model (Walther et al., 2002; Walther, 2006), showing that, despite the initial optimism, the model suffers from the same limitations as other purely feed-forward object recognition models. In order to solve this problem, saliency-based spatial visual attention (Koch & Ullman, 1985; Itti et al., 1998) is added to the HMAX system, with some limited feature-sharing implemented in order to match the size of the selected area to that of the selected object (Walther et al., 2002).

Complete Chapter List

Search this Book:
Reset