Five papers by CSE researchers presented at ICML 2023

The papers authored by CSE researchers appearing at the conference cover a breadth of topics related to machine learning.

Five papers by CSE researchers have been accepted for presentation at the 2023 International Conference on Machine Learning (ICML), taking place July 23-29 in Honolulu, Hawaii. ICML is one of the world’s most prominent and fastest-growing conferences on artificial intelligence and machine learning, bringing together top experts in these areas to share their latest findings and innovations.

The research being presented at ICML 2023 spans a range of topics related to machine learning. In their papers, CSE researchers explore the phenomenon of neural collapse, propose a new system for hyperbolic image-text representation, improve explanations for out-of-distribution detectors, and more.

The papers appearing at the conference are as follows, with the names of CSE researchers in bold:

Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations

Yongyi Yang, Jacob Steinhardt, Wei Hu

Abstract: Recent work has observed an intriguing ”Neural Collapse” phenomenon in well-trained neural networks, where the last-layer representations of training samples with the same label collapse into each other. This appears to suggest that the last-layer representations are completely determined by the labels, and do not depend on the intrinsic structure of input distribution. We provide evidence that this is not a complete description, and that the apparent collapse hides important fine-grained structure in the representations. Specifically, even when representations apparently collapse, the small amount of remaining variation can still faithfully and accurately captures the intrinsic structure of input distribution. As an example, if we train on CIFAR-10 using only 5 coarse-grained labels (by combining two classes into one super-class) until convergence, we can reconstruct the original 10-class labels from the learned representations via unsupervised clustering. The reconstructed labels achieve 93% accuracy on the CIFAR-10 test set, nearly matching the normal CIFAR-10 accuracy for the same architecture. We also provide an initial theoretical result showing the fine-grained representation structure in a simplified synthetic setting. Our results show concretely how the structure of input data can play a significant role in determining the fine-grained structure of neural representations, going beyond what Neural Collapse predicts.

Hyperbolic Image-Text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam

Abstract: Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept “dog” entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP’s performance on standard multi-modal tasks like image classification and image-text retrieval.

Three images: first of a labrador in the snow, second of a cat and dog playing, third a closeup of a cat. The images are labeled in a hierarchical order to show which labels are attached to them in what order. For instance, only the second photo is labeled "a cat and a dog playing" but all three images are labeled 'so cute"
Images and text depict concepts and can be jointly viewed in a visual-semantic hierarchy, wherein text ‘exhausted doggo’ is more generic than an image (which might have more details like a cat or snow). Our method MERU embeds images and text in a hyperbolic  space that is well-suited to embed tree-like data.

Text-To-4D Dynamic Scene Generation

Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman

Abstract: We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.

Go Beyond Imagination: Maximizing Episodic Reachability with World Models

Yao Fu, Run Peng, Honglak Lee

Abstract: Efficient exploration is a challenging topic in reinforcement learning, especially for sparse reward tasks. To deal with the reward sparsity, people commonly apply intrinsic rewards to motivate agents to explore the state space efficiently. In this paper, we introduce a new intrinsic reward design called GoBI – Go Beyond Imagination, which combines the traditional lifelong novelty motivation with an episodic intrinsic reward that is designed to maximize the stepwise reachability expansion. More specifically, we apply learned world models to generate predicted future states with random actions. States with more unique predictions that are not in episodic memory are assigned high intrinsic rewards. Our method greatly outperforms previous state-of-the-art methods on 12 of the most challenging Minigrid navigation tasks and improves the sample efficiency on locomotion tasks from DeepMind Control Suite.

Two sets of four renderings each, the first showing a set of 2D navigation tasks, the second showing a set of 3D control tasks
Rendering of the environments used in this work. Left: 2D grid world navigation tasks that require object interactions. Right: DeepMind Control tasks with visual observations.

Concept-based Explanations for Out-Of-Distribution Detectors

Jihye Choi, Jayaram Raghuram, Ryan Feng, Jiefeng Chen, Somesh Jha, Atul Prakash

Abstract: Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector’s decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose an unsupervised framework for learning a set of concepts that satisfy the desired properties of high detection completeness and concept separability, and demonstrate its effectiveness in providing concept-based explanations for diverse off-the-shelf OOD detectors. We also show how to identify prominent concepts contributing to the detection results, and provide further reasoning about their decisions.