This chapter briefly describes APIs that could be used today for prototyping or implementing proposed architecture and methods.
TopIntroduction
Functional active vision systems aren’t new, and they been developed within the past decade. But they are based on the combo of traditional computational methods, and heavily rely on GPS. Such systems still may guarantee adequate safety neither for the human nor for themselves.
Autonomous vehicles and robots need much smarter perceptual and cognitive systems that will allow them remain one hundred percent safe for human and at the same time successfully accomplish more complex missions.
A distributed software environment that works with network-symbolic methods and models can be considered as a new flavor of multi-agent systems, and we can call a practical software realization as “Image/Video Understanding Engines” (by analogy with the “Speech Recognition Engines” -- software components used in the speech recognition software.)
The Engine needs to handle with Model and Symbolic spaces, and this requires special APIs. Model spaces API should be able to work with graphs; symbolic spaces API should be able to recognize patterns, create new implicit symbols, and provide hierarchical compression of the obtained results. Relevant Spaces should be loaded from the underlying repository, linked, processed, and meaningful outcomes should be stored back to the repository. The system should be able to provide mechanisms similar to the activation of relevant Spaces and attention, as well as needed transformations via search, synthesis, and analysis cycle.
Besides that, the platform should have some low-level processing services, which comprise of visual and object buffers described above, and also buffer which does an active fusion of different visual features in the manner of 2-½ Sketch. And the system needs to plan and predict its behavior upon the generated situation awareness, which would be equivalent to the motor programs in the human cortex.
All these ideas are presented in Figure 1. The engine should be installed on an active vision platform, which is mandatory for autonomous systems, where Action, Perception, and Cognition should work in a coherent manner, which is shown in Figure 2 that depicts processes in the system. To describe those processes, we will be using terminology that is similar to human vision, although its technological implementation might be completely different.
Figure 1.
Image/video understanding engine
The model of the visual scene is obtained via saccades and fixations. Visual information appears in the system as a retinal flow. Retinal flow transforms into features or symbols, which are combined into a primary relational network-symbolic structure: 2.5D Sketch. This allows derivation of other network-symbolic structures like objects, switching attention to different sub-structures, processing them one at a time, and building a more abstract hierarchical model of the visual scene, which is accountable for image understanding and situation awareness of the observer.
The schema can be represented as a scene graph. In the real world situations, the top-level abstract network-symbolic model of the scene disambiguates lower-level features and objects, which change in the flow. Feedback projections help create unambiguous network-symbolic structures of the visual scene and map the obtained “symbols” (understanding) back to the primary image structure for navigation and action tasks. All processes in the system are interdependent. Incremental changes in the visual scene drive motion in the environment. Motion creates changes and disambiguates visual information.
When there is a need in a new software system, the first step is to look for the commercially available software that may either serve as a prototype, or proved some APIs and/or components. It may help to avoid reinventing the wheel.
Figure 2.
A system of active vision with network-symbolic models