The last ten years have witnessed an astonishing revolution in AI, with deep neural networks suddenly approaching human-level performance on problems like recognizing objects in an image and words in an audio recording. But impressive as these feats are, they fall far short of human-like intelligence. The critical gap between current AI and human intelligence is that, beyond just classifying patterns of input, humans build mental models of the world. This project begins with the problem of physical scene understanding: how one extracts not just the identities and locations of objects in the visual world, but also the physical properties of those objects, their positions and velocities, their relationships to each other, the forces acting upon them, and the effects of forces that could be exerted on them. It is hypothesized that humans represent this information in a structured mental model of the physical world, and use that model to predict what will happen next, much as the physics engine in a video game generates physically plausible future states of virtual worlds. To test this idea, computational models of physical scene understanding will be built and tested for their ability to predict future states of the physical world in a variety of scenarios. Performance of these models will then be compared to humans and to more traditional deep network models, both in terms of their accuracy on each task, and their patterns of errors. Computational models that incorporate structured representations of the physical world will then be tested against standard convolutional neural networks in their ability to explain neural responses of the human brain (using fMRI) and the monkey brain (using direct neural recording). These computational models will provide the first explicit theories of how physical scene understanding might work in the human brain, at the same time advancing the ability of AI systems to solve the same problems. Because the ability to understand and predict the physical world is essential for planning any action, this work is expected to help advance many technologies that require such planning, from robotics to self-driving cars to brain-machine interfaces. Each of the participating labs will also expand their established track records of recruiting, training, and mentoring women and under-represented minorities at the undergraduate, graduate, and postdoctoral levels. Finally, the collaborating laboratories will continue and increase their involvement in the dissemination of science to the general public, via public talks, web sites, and outreach activities.
Deep neural networks have revolutionized object recognition in computers as well as understanding of object recognition in the primate brain, but object recognition is just one aspect of vision, and the ventral stream is just one of many brain systems. Studying physical scene understanding is a step toward scaling this reverse-engineering approach up to the rest of the mind and brain. Predicting what will happen next and planning effective action requires understanding the physical basis and physical relationships in the visual world. Yet it is unknown how humans do this or how machines could. Both challenges are addressed in this project by the building of image computable, neurally mappable computational models of physical scene understanding and prediction (Thread I), and using these models as explicit hypotheses for how the brain might accomplish these tasks, which will then be tested with behavioral and neural data from humans (Thread II) and non-human primates (Thread III). This project aims to make a transformative leap in understanding: from small-scale, special-case models and isolated experimental tests to an integrated large-scale, general-purpose model of a major swathe of the primate brain, that functionally explains much of the immediate content of our perceptual experience in every scene that confronts us. The work will advance theory by developing the first image-computable models capable of human-level physical scene understanding and prediction. Beyond understanding of the mind and brain, this research is directly relevant to AI and robotics (which require physical scene understanding), and brain-machine interfaces (which require understanding of the relevant neural codes). For the broader research community, the project will a) develop public datasets, benchmark tasks, and challenges, b) host adversarial collaborations to address these challenges, and c) host interdisciplinary workshops linking research communities from psychology to AI to neuroscience to address the fundamental questions that span these fields.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.