publications
* denotes equal contribution and joint lead authorship.
2024
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models.
In Proceedings of the Forty-First IEEE/CVF Conference on Computer Vision and Pattern Recognition. Oral presentation 2024.
Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM’s ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use.
In Proceedings of the Forty-First IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024.
From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally, developing classifiers for such concepts requires substantial manual effort measured in hours, days, or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques, which enable rapid bootstrapping of image classifiers, users are still required to spend 30 minutes or more of monotonous, repetitive data labeling just to train a single classifier. Drawing on Fiske’s Cognitive Miser theory, we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions, reducing the total effort required to define a concept by an order of magnitude: from labeling 2,000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models, both large language models and vision-language models, to carve out the concept space through conversation and by automatically labeling training data points. Most importantly, our framework eliminates the need for crowd-sourced annotations. Moreover, our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets, our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.Scaling Up LLM Reviews for Google Ads Content Moderation.
In Industry Day Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM) 2024.
Large language models (LLMs) are powerful tools for content moderation, but their inference costs and latency make them prohibitive for casual use on large datasets, such as the Google Ads repository. This study proposes a method for scaling up LLM reviews for content moderation in Google Ads. First, we use heuristics to select candidates via filtering and duplicate removal, and create clusters of ads for which we select one representative ad per cluster. We then use LLMs to review only the representative ads. Finally, we propagate the LLM decisions for the representative ads back to their clusters. This method reduces the number of reviews by more than 3 orders of magnitude while achieving a 2x recall compared to a baseline non-LLM model. The success of this approach is a strong function of the representations used in clustering and label propagation; we found that cross-modal similarity representations yield better results than uni-modal representations.
2023
Benchmarking Robustness to Adversarial Image Obfuscations.
In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems 2023.
Automated content filtering and moderation is an important tool that allows online platforms to build striving user communities that facilitate cooperation and prevent abuse. Unfortunately, resourceful actors try to bypass automated filters in a bid to post content that violate platform policies and codes of conduct. To reach this goal, these malicious actors may obfuscate policy violating images (e.g. overlay harmful images by carefully selected benign images or visual patterns) to prevent machine learning models from reaching the correct decision. In this paper, we invite researchers to tackle this specific issue and present a new image benchmark. This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors. It goes beyond ImageNet-C and ImageNet-C¯ by proposing general, drastic, adversarial modifications that preserve the original content intent. It aims to tackle a more common adversarial threat than the one considered by Lp-norm bounded adversaries. We evaluate 33 pretrained models on the benchmark and train models with different augmentations, architectures and training methods on subsets of the obfuscations to measure generalization. We hope this benchmark will encourage researchers to test their models and methods and try to find new approaches that are more robust to these obfuscations.Agile Modeling: From Concept to Classifier in Minutes.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 2-6, 2023 2023.
The application of computer vision methods to nuanced, subjective concepts is growing. While crowdsourcing has served the vision community well for most objective tasks (such as labeling a “zebra”), it now falters on tasks where there is substantial subjectivity in the concept (such as identifying “gourmet tuna”). However, empowering any user to develop a classifier for their concept is technically difficult: users are neither machine learning experts nor have the patience to label thousands of examples. In reaction, we introduce the problem of Agile Modeling: the process of turning any subjective visual concept into a computer vision model through real-time user-in-the-loop interactions. We instantiate an Agile Modeling prototype for image classification and show through a user study (N=14) that users can create classifiers with minimal effort in under 30 minutes. We compare this user driven process with the traditional crowdsourcing paradigm and find that the crowd’s notion often differs from that of the user’s, especially as the concepts become more subjective. Finally, we scale our experiments with simulations of users training classifiers for ImageNet21k categories to further demonstrate the efficacy of the approach.
2021
Coarse-to-Fine Curriculum Learning.
In ArXiv Preprint 2021.
When faced with learning challenging new tasks, humans often follow sequences of steps that allow them to incrementally build up the necessary skills for performing these new tasks. However, in machine learning, models are most often trained to solve the target tasks directly.Inspired by human learning, we propose a novel curriculum learning approach which decomposes challenging tasks into sequences of easier intermediate goals that are used to pre-train a model before tackling the target task. We focus on classification tasks, and design the intermediate tasks using an automatically constructed label hierarchy. We train the model at each level of the hierarchy, from coarse labels to fine labels, transferring acquired knowledge across these levels. For instance, the model will first learn to distinguish animals from objects, and then use this acquired knowledge when learning to classify among more fine-grained classes such as cat, dog, car, and truck. Most existing curriculum learning algorithms for supervised learning consist of scheduling the order in which the training examples are presented to the model. In contrast, our approach focuses on the output space of the model. We evaluate our method on several established datasets and show significant performance gains especially on classification problems with many labels. We also evaluate on a new synthetic dataset which allows us to study multiple aspects of our method.
2020
Modeling Task Effects on Meaning Representation in the Brain via Zero-Shot MEG Prediction.
In Proceedings of the Thirty-fourth Conference on Neural Information Processing Systems 2020.
How meaning is represented in the brain is still one of the big open questions in neuroscience. Does a word (e.g., bird) always have the same representation, or does the task under which the word is processed alter its representation (answering "can you eat it?" versus "can it fly?")? The brain activity of subjects who read the same word while performing different semantic tasks has been shown to differ across tasks. However, it is still not understood how the task itself contributes to this difference. In the current work, we study Magnetoencephalography (MEG) brain recordings of participants tasked with answering questions about concrete nouns. We investigate the effect of the task (i.e. the question being asked) on the processing of the concrete noun by predicting the millisecond-resolution MEG recordings as a function of both the semantics of the noun and the task. Using this approach, we test several hypotheses about the task-stimulus interactions by comparing the zero-shot predictions made by these hypotheses for novel tasks and nouns not seen during training. We find that incorporating the task semantics significantly improves the prediction of MEG recordings, across participants. The improvement occurs 475-550ms after the participants first see the word, which corresponds to what is considered to be the ending time of semantic processing for a word. These results suggest that only the end of semantic processing of a word is task-dependent, and pose a challenge for future research to formulate new hypotheses for earlier task effects as a function of the task and stimuli.Contextual Parameter Generation for Knowledge Graph Link Prediction.
In Proceedings of the Thirty-fourth AAAI Conference on Artificial Intelligence 2020.
We consider the task of knowledge graph link prediction. Recent approaches tackle this problem by learning entity and relation embeddings. However, they often constrain the relationship between these embeddings to be additive (i.e., the embeddings are concatenated and then processed by a sequence of linear functions and element-wise non-linearities). We show that this type of interaction significantly limits representational power, and instead propose to use contextual parameter generation to address this limitation. More specifically, we treat relations as the context in which entities are processed to produce predictions, by using relation embeddings to generate the parameters of a model operating over entity embeddings. We apply our method on two existing link prediction methods, including the current state-of-the-art, resulting in significant performance gains and establishing a new state-of-the-art for this task. These gains are achieved while also reducing training time by up to 28 times.Coarse-to-Fine Curriculum Learning for Classification.
In International Conference on Learning Representations (ICLR) Workshop on Bridging AI and Cognitive Science (BAICS) 2020.
When faced with learning challenging new tasks, humans often follow sequences of steps that allow them to incrementally build up the necessary skills for performing these new tasks. However, in machine learning, models are most often trained to solve the target tasks directly. Inspired by human learning, we propose a novel curriculum learning approach which decomposes challenging tasks into sequences of easier intermediate goals, that are used to pre-train a model before tackling the original task. We focus on classification tasks, and design the intermediate tasks using an automatically constructed label hierarchy. We train the model at each level of the hierarchy, from coarse labels to fine labels, transferring acquired knowledge across these levels. For instance, the model will first learn to distinguish animals from objects, and use this acquired knowledge when learning to classify more fine-grained classes such as "cat", "dog", "car", and "truck". We evaluate our method on several established datasets and show performance gains of up to 7% increase in classification accuracy.
2019
Graph Agreement Models for Semi-Supervised Learning.
In Proceedings of the Thirty-third Conference on Neural Information Processing Systems 2019.
Graph-based algorithms are among the most successful paradigms for solving semi-supervised learning tasks. Recent work on graph convolutional networks and neural graph learning methods has successfully combined the expressiveness of neural networks with graph structures. We propose a technique that, when applied to these methods, achieves state-of-the-art results on semi-supervised learning datasets. Traditional graph-based algorithms, such as label propagation, were designed with the underlying assumption that the label of a node can be imputed from that of the neighboring nodes. However, real-world graphs are either noisy or have edges that do not correspond to label agreement. To address this, we propose Graph Agreement Models (GAM), which introduces an auxiliary model that predicts the probability of two nodes sharing the same label as a learned function of their features. The agreement model is used when training a node classification model by encouraging agreement only for the pairs of nodes it deems likely to have the same label, thus guiding its parameters to better local optima. The classification and agreement models are trained jointly in a co-training fashion. Moreover, GAM can also be applied to any semi-supervised classification problem, by inducing a graph whenever one is not provided. We demonstrate that our method achieves a relative improvement of up to 72% for various node classification models, and obtains state-of-the-art results on multiple established datasets.Efficient Multitask Feature and Relationship Learning.
In Proceedings of the 2019 Annual Conference on Uncertainty in Artificial Intelligence 2019.
We consider a multitask learning problem, in which several predictors are learned jointly. Prior research has shown that learning the relations between tasks, and between the input features, together with the predictor, can lead to better generalization and interpretability, which proved to be useful for applications in many domains. In this paper, we consider a formulation of multitask learning that learns the relationships both between tasks and between features, represented through a task covariance and a feature covariance matrix, respectively. First, we demonstrate that existing methods proposed for this problem present an issue that may lead to ill-posed optimization. We then propose an alternative formulation, as well as an efficient algorithm to optimize it. Using ideas from optimization and graph theory, we propose an efficient coordinate-wise minimization algorithm that has a closed form solution for each block subproblem. Our experiments show that the proposed optimization method is orders of magnitude faster than its competitors. We also provide a nonlinear extension that is able to achieve better generalization than existing methods.Investigating Task Effects on Brain Activity During Stimulus Presentation in MEG.
In Human Brain Mapping Conference 2019.
Recorded brain activity of subjects who perceive the same stimulus (e.g. a word) while performing different semantic tasks (e.g. identifying whether the word belongs to a particular category) has been shown to differ across tasks. However, it is not well understood how precisely the task contributes to this brain activity. In the current work, we propose multiple hypotheses of how possible interactions between the task and stimulus semantics can be related to the observed brain activity. We test these hypotheses by designing machine learning models to represent each hypothesis, training them to predict the recorded brain activity, and comparing their performance. We show that incorporating task semantics improves the prediction of single-trial MEG data by an average of 10% across subjects.Competence-based Curriculum Learning for Neural Machine Translation.
In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics 2019.
Current state-of-the-art NMT systems use large neural networks that are not only slow to train, but also often require many heuristics and optimization tricks, such as specialized learning rate schedules and large batch sizes. This is undesirable as it requires extensive hyperparameter tuning. In this paper, we propose a curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance. Our framework consists of a principled way of deciding which training samples are shown to the model at different times during training, based on the estimated difficulty of a sample and the current competence of the model. Filtering training samples in this manner prevents the model from getting stuck in bad local optima, making it converge faster and reach a better solution than the common approach of uniformly sampling training examples. Furthermore, the proposed method can be easily applied to existing NMT models by simply modifying their input data pipelines. We show that our framework can help improve the training time and the performance of both recurrent neural network models and Transformers, achieving up to a 70% decrease in training time, while at the same time obtaining accuracy improvements of up to 2.2 BLEU.Subthalamic nucleus and sensorimotor cortex activity during speech production.
In The Journal of Neuroscience: the official journal of the Society for Neuroscience 2019.
The sensorimotor cortex is somatotopically organized to represent the vocal tract articulators, such as lips, tongue, larynx, and jaw. How speech and articulatory features are encoded at the subcortical level, however, remains largely unknown. We analyzed local field potential (LFP) recordings from the subthalamic nucleus (STN) and simultaneous electrocorticography recordings from the sensorimotor cortex of 11 human subjects (1 female) with Parkinson’s disease during implantation of deep brain stimulation (DBS) electrodes, while they read aloud three-phoneme words. The initial phonemes involved either articulation primarily with the tongue (coronal consonants) or the lips (labial consonants). We observed significant increases in high gamma (60-150 Hz) power in both the STN and the sensorimotor cortex that began before speech onset and persisted for the duration of speech articulation. As expected from previous reports, in the sensorimotor cortex, the primary articulators involved in the production of the initial consonants were topographically represented by high gamma activity. We found that STN high gamma activity also demonstrated specificity for the primary articulator, although no clear topography was observed. In general, subthalamic high gamma activity varied along the ventral-dorsal trajectory of the electrodes, with greater high gamma power recorded in the dorsal locations of the STN. Interestingly, the majority of significant articulator-discriminative activity in the STN occurred prior to that in sensorimotor cortex. These results demonstrate that articulator-specific speech information is contained within high gamma activity of the STN, but with different spatial and temporal organization compared to similar information encoded in the sensorimotor cortex. Clinical and electrophysiological evidence suggest that the subthalamic nucleus is involved in speech, however, this important basal ganglia node is ignored in current models of speech production. We previously showed that subthalamic nucleus neurons differentially encode early and late aspects of speech production, but no previous studies have examined subthalamic functional organization for speech articulators. Using simultaneous local field potential recordings from the sensorimotor cortex and the subthalamic nucleus in patients with Parkinson’s disease undergoing deep brain stimulation surgery, we discovered that subthalamic nucleus high gamma activity tracks speech production at the level of vocal tract articulators, prior to the onset of vocalization and often prior to related cortical encoding.
2017
BrainZoom: High Resolution Reconstruction from Multi-modal Brain Signals.
In Proceedings of the 2017 SIAM International Conference on Data Mining 2017.
How close can we zoom in to observe brain activity? Our understanding is limited by the resolution of imaging modalities that exhibit good spatial but poor temporal resolution, or vice-versa. In this paper, we propose BRAINZOOM, an efficient imaging algorithm that cross-leverages multi-modal brain signals. BRAINZOOM (a) constructs high resolution brain images from multi-modal signals, (b) is scalable, and (c) is flexible in that it can easily incorporate various priors on the brain activities, such as sparsity, low rank, or smoothness. We carefully formulate the problem to tackle nonlinearity in the measurements (via variable splitting) and auto-scale between different modal signals, and judiciously design an inexact alternating optimization-based algorithmic framework to handle the problem with provable convergence guarantees. Our experiments using a popular realistic brain signal simulator to generate fMRI and MEG demonstrate that high spatio-temporal resolution brain imaging is possible from these two modalities. The experiments also suggest that smoothness seems to be the best prior, among several we tried.Efficient Multi-task Feature and Relationship Learning.
In In Neural Information Processing Systems Workshop on Learning With Limited Data 2017.
We propose a multi-convex framework for multitask learning that improves pre- dictions by learning relationships both between tasks and between features. Our framework is a generalization of related methods, that either learn task relationships, or feature relationships, but not both. We start with a hierarchical Bayesian model, and use the empirical Bayes method to transform the underlying inference problem into a multi-convex problem. To tackle the multi-convex optimization problem, we propose a block coordinate-wise minimization algorithm that has a closed form solution for each block subproblem. Naively these solutions would be expensive to compute, but by using the theory of doubly stochastic matrices, we are able to reduce the covariance learning subproblem to a minimum-weight perfect matching problem on a complete bipartite graph, and solve it analytically and efficiently. To solve the weight learning subproblem, we propose three different strategies that can be used no matter whether the instances are shared by multiple tasks or not. We demonstrate the efficiency of our method on both synthetic datasets and real-world datasets. Experiments show that the proposed optimization method is orders of magnitude faster than the previous projected gradient method, and our model is able to exploit the correlation structures among multiple tasks and features.Understanding the neural basis of speech production using Machine Learning.
In Master Thesis at Carnegie Mellon University 2017.
Background. Understanding how neurons act together to produce speech is still an open problem. Several studies have attempted to decode different aspects of speech from the cortex neural activity, while a person is speaking [1, 8], but evidence from the medical domain [2] suggests that other brain regions, such as the subthalamic nucleus (STN), may also be involved in speech production.
Aim. We explore the problem of decoding properties of speech (e.g. volume, manner, used articulators) from neural activity recordings. From a neuroscience perspective, the goal is to understand which properties are encoded in different parts of the cortex and STN. From a machine learning (ML) perspective, we are interested discover what models are best at extracting relevant information from the scarce and noisy neural activity data.
Data. The brain signals are recorded from 14 human subjects, while reading out loud words. The data consists of ECoG recordings from the ventral primary motor and primary sensory cortical areas, and Local Field Potentials from the STN.
Proposed Approach. We approach several decoding tasks: (1) predicting when a person is speaking or not, from their neural activity, (2) predicting the manner of speech (e.g. nasal, plosive) and what articulators (e.g. tongue, lips) are used, and (3) predicting the volume/loudness of the speech. For each of these tasks, we apply a series of ML methods, from simple regression models (e.g. ridge regression) to deep learning models (e.g. recurrent neural networks). We apply these models on different levels of preprocessing of the data, from electrode signals in time domain to particular frequency bands. The goal is to understand which models are able to discover interesting patterns, with different levels of domain knowledge required for preprocessing. Finally, we use our best models to discover which areas of the brain encode different kinds of information about speech production.
Results. Our analysis shows that machine learning models are able to discover different speech features from the neural activity. We were able to classify when a subject is speaking or not from their neural activity in the primary motor and primary sensory cortex with up to 96% accuracy, and up to 80% accuracy from the STN (an area whose connection to speech production is not entirely understood). We were also able to decode certain features of speech (e.g. voicing, manner) and inspect the brain regions and time points that contribute to the prediction. From a ML perspective, we observe that even simple models overfit easily on our dataset due to the low-sample, highdimensionality problem, and that parameter tuning and proper regularization methods are crucial in making accurate predictions. Finally, we recommend neural network based models if time and computation resources are available for tuning the parameters, and simple models otherwise.
Broader impacts. Our work has important practical applications in the medical domain. For example, a neural decoder can be used in neuroprosthetics to enable communication for the impaired. Moreover, understanding the involvement of the STN in speech can improve the deep brain stimulation techniques used for treating Parkinson’s disease patients. More generally, our work provides an useful overview of which ML models are suitable in dealing with different modalities of brain data, which can facilitate further neuroscience studies.
Keywords: neuroscience, speech production, ECoG, Local Field Potentials, machine learning.
2015
Multiple Frames Matching for Object Discovery in Video.
In Proceedings of the 26th British Machine Vision Conference (BMVC) 2015.
Automatic discovery of foreground objects in video sequences is an important prob- lem in computer vision with applications to object tracking, video segmentation and clas- sification. We propose an efficient method for the discovery of object bounding boxes and the corresponding soft-segmentation masks across multiple video frames. We offer a graph matching formulation for bounding box selection and refinement using second and higher order terms. Our objective function takes into consideration local, frame-based information, as well as spatiotemporal and appearance consistency over multiple frames. First, we find an initial pool of candidate boxes using a novel and fast foreground esti- mation method in video, based on Principal Component Analysis. Then, we match the boxes across multiple frames using pairwise geometric and appearance terms. Finally, we refine their location and soft-segmentation using higher order potentials that estab- lish appearance regularity over multiple frames. We test our method on the large scale YouTube-Objects dataset and obtain state-of-the-art results on several object classes.A multi-method driven evaluation of molecular imaging techniques.
In Poster presentation at the 10th annual meeting of the European Society for Molecular Imaging (ESMI) 2015.
In the past decade, we have witnessed significant advancements in image analysis methods. However, the analysis of molecular images such as fluorescent in situ hybridization (FISH) still relies heavily on manual evaluation by experts. For automated image analysis to be adopted by clinicians there is a need for more reliable tools that can extract rich information from molecular images. The first step in developing such tools is to evaluate the performance of state-of-the-art image analysis methods. This assessment could be further used to develop of a multi-method approach that would integrate partial information extracted by each constituent method, to build an augmented microscopic reality. We investigate a set of techniques which have shown promising results in the analysis of images that share many similarities with molecular images. Namely, wavelets, that have successfully been used to describe structural patterns in astronomical images [1], and Markov Random Fields. We apply them to a benchmark of molecular FISH images and evaluate their performance with respect to two key tasks: (1) denoising and thresholding; (2) segmentation and detection of various structures(e.g. interphase nuclei, metaphase chromosomes, multi-colored probes). We further test whether data from high-throughput Chromosome Conformation Capture (Hi-C) could enrich the information obtained. We show the results of our analysis and identify which techniques are best for describing certain properties of images (e.g. intensity, edges). We further discuss methodologies of integrating complementary methods and data to improve the overall performance. The integration of multiple methods can result in overall improvement with respected to key tasks in the analysis of molecular images. Coupling a multi-method approach with Chromosome Conformation Capture data can help to extract additional information from images.
[1] Mertens, F., and Lobanov A. (2014) Wavelet-based decomposition and analysis of structural patterns in astronomical images. arXiv preprintarXiv:1410.3732