Perception Hierarchies
Computer Vision as a Sensor in a Hierarchical Control System
A key requirement for hierarchical intelligent systems is their ability to abstract multi-sensory perceptions with labels for the planning layers of the hierarchy, and a capability to have an elaboration of high-level instructions into detailed actions for all the agents. Consider the example of driving based on visual sensing. At the lowest level, the feedback loops of lateral and longitudinal control are supported by simple measurements from the video stream of lane position, orientation, and curvature (for lateral control) and a stereo disparity map (for longitudinal control). At a higher level, the data need to be grouped into individual vehicles with estimates of their velocities. At an even higher level, one would like to infer behaviors (such as ``is that vehicle disabled?'').We will develop a framework for abstraction of visual processing in navigation tasks. At the lowest level, we have spatiotemporal measurements of image brightness values. The next level is the measurement of local properties by various neighborhood operators (``receptive fields'' in biological terminology) that process color, texture, disparity, optical flow etc. At the next stage, some kind of grouping based on spatiotemporal coherence is used to infer surfaces. Finally, we are at the level of distinct objects and groups of objects with associated behaviors. What makes this hierarchical grouping process hard is that at each stage, an estimation process is involved and the measurements or decisions made are only correct with some probability less than 1. We believe that the key to decision making is in invoking additional models at each of the later stages of the processing. These models are more specific and more global (in terms of neighborhoods in the image) hence they have a greater statistical efficiency in terms of reducing false positives and misses.
We are developing a system based on vision as a sensor technology for vehicle control. The novel feature of this project, compared to most previous approaches, is the extensive use of binocular stereopsis. First, it provides information for obstacle detection, grouping, and range estimation which is directly used for longitudinal control. Second, the obstacle--ground separation enables robust localization of partially occluded lane boundaries as well as the dynamic update of camera rig parameters to deal with vibrations and vertical road curvature. See the following figure for a sample result from [38].
![]()
Objects identifed as being cars in the traffic in front of the test vehicle.
On the right side of the image is a bird's-eye-view from above
the road surface showing the relative position of the tracked objects
with respect to the rest vehicle.
Early Vision: Algorithms and Cellular Neural Network Implementations
Early vision can be characterized as the process of starting from a video signal I(x,y,t) (or possibly two such signals for a binocular system) and estimating various scene properties such as depth, surface orientation, and curvature and locations of discontinuities in these. There are several different cues in the video signal that make this estimation possible-texture, stereoscopic disparity, and optic flow being the most important.Research in Malik's group at Berkeley over the last several years has has led to significant advances in each of these three areas [39,40,41]. The main innovation in the work was motivated by the observation that while most work in computer vision has been based on a first stage of edge detection, a key characteristic of natural scenes is the presence of texture on the various surfaces. This makes the traditional approach of first extracting edges or line segments rather ineffective-while many of the edges corresponding to object boundaries may be lost, most of the edge segments found correspond to inner texture microedges. This led us to an alternative framework that was initially motivated by the known physiology of visual processing in retina and the V1 area of cerebral cortex. The image is first convolved with a set of linear filters tuned to different orientations and scales and the output of this analysis used directly for subsequent computation. Though motivated by biology, this approach can be justified purely on analytic and computational considerations. Postponing the signal-to-symbol transformation is in keeping with the principle of least commitment whose utility for reducing search has been well recognized in AI. The vector of filter outputs provides a rich local descriptor of the image which makes matching problems much easier. This approach tries to extract maximum mileage from simple, local, and parallel computations, making VLSI implementations feasible. To provide a solid justification, we have successfully developed specific models for the various early vision modalities-stereopsis, texture, motion-and demonstrate both experimentally and theoretically the advantages that accrue from this alternative view.
At this stage, we believe that the early vision modules are sufficiently well understood and we will concentrate on developing VLSI implementations. We will use our expertise (Chua's group) to translate low- to mid-level perception and data abstraction algorithms into low-power, extremely high speed, and portable hybrid analog-digital architectures called Cellular Neural Networks (CNN). We propose to develop and implement VLSI CNN chips with tens of thousands of processor cells per chip which can be integrated with decision-making hardware at various stages in the overall system.
The Cellular Neural Network is a novel parallel computer architecture capable of unprecedented speed and power in a compact low-power package. Early estimates indicate that a VLSI CNN chip can perform image processing operations many orders of magnitude faster than conventional digital techniques. This high speed allows the higher level systems to change the parameters of the CNN system many times during the acquisition of a scene to find the optimal parameters. To be able to develop this technology into practical systems, we propose the focus our research along the following three parallel lines. We plan to develop a theoretical foundation of the dynamics of CNNs; analog-logic algorithms for the CNN to perform low- to mid-level visual processing; and hardware implementations of the CNN chipset.
In the first project we will build up the theoretical foundation for the dynamics of CNNs. The CNN template defines the interconnection weights among the cells and determines the functionality of the CNN. We propose to develop a template calculus which allow us to deduce the functionality of the CNN from the template elements. Preliminary research suggest that techniques from switching theory, mathematical morphology, and nonlinear signal processing are useful for this project. A second project will deal with the algorithmic aspects of CNNs. We intend to develop and use software tools and techniques to design algorithms for the CNN for specific applications. The results from the first project will be used heavily to design the various CNN algorithms needed for data abstraction, fusion, and compression. Techniques for automated algorithm design will also be explored, such as evolutionary programming and nonlinear programming. The third project addresses hardware implementation issues of CNN VLSI chips. The goal of this project is to build VLSI chips utilizing the CNN Universal Machine (CNNUM) architecture. The CNN Universal Machine consists of the basic analog CNN array core, augmented with local analog and logic memory and programming unit resulting in a truly algorithmic analog-digital processor exploiting the speed of analog arrays with the flexibility of digital processors. Several tasks that are accomplished in this project include:
- Implementation of CNNUM arrays in VLSI technology. Initial research shows that single chip arrays with thousands of cells are possible with current technologies. We plan to design and implement next generation CNNUM chips with tens of thousands of cells.
- Implementation of supporting chip set. This include analog RAM for fast cache without the need for analog-to-digital conversion.
- Implementation of sensor arrays. As the CNN can input and output data in parallel, the sensor array should be tailored to the capabilities of the CNN chip. Other input modalities such as IR will also be investigated.
Computer Vision for Automated Surveillance
An automated surveillance system consists of a series of video cameras, positioned to observe dynamic, but structured environments, such as harbors, airports, freeways, or battlefields, and linked to a computer system that can detect unusual behavior, unauthorized traffic, or surprising and unexpected changes and alert a human operator. Automated surveillance systems have a wide range of applications in both military and civilian life. In our previous work [42] , we have developed a prototype real time system for monitoring traffic scenes using video information. The objective is to estimate traffic parameters such as flow rates, speeds, vehicle itineraries, and link travel times, as well as to quickly detect disruptive incidents such as stalled vehicles and accidents. Such a system is illustrated in the following figure.
![]()
Our system concept for video traffic surveillance. Processing occurs in four stages. First, optical
-flow and image-differencing methods detect and group moving blobs corresponding to vehicles.
Then, vehicles are tracked to refine and update positions, velocities, and shape parameters. Next,
the system reasons from the track data in order to infer local traffic flow parameters. These
parameters, together with vehicle track information, are communicated to the TMC at regular
intervals. Finally, at the TMC, local traffic parameters from each site are collated, and global
information such as link travel times and vehicle itineraries is computed from the track data.
The results are then used in controlling signals, message displays, and other traffic control devices.
This work, which illustrates the ideas behind our approach in more restricted settings than those we envisage, demonstrates the tractability of the problem and the effectiveness of our approach. The system would benefit from bringing to bear additional constraints, e.g., geometric models of cars can be used to separate out shadows. Furthermore, we believe that a system can be designed and implemented that would make combined use of visual cues: low level ones such as motion, texture, color as well as high level cues springing from class / object / activity models. The aim of our research is to construct such a system and to apply it to problems in automated surveillance.Usually, applications of present computer vision technology are limited to such areas as offline military intelligence, where analysts can interact intensively with systems which operate on very carefully structured problems. Our proposed research will remove a number of these restrictions. In particular, by allowing recognition systems to use a wider range of geometrical cues, by incorporating such cues as color and motion, and by expanding the representations used, we will place the applications listed above within reach.