- Involvement in Scientific Societies
- Facebook research being presented at CVPR
- Scalable active learning for multiclass image classification.
- Scientific Involvement
Jonathon Phillips. Michael Fitzpatrick and Jay B. Chhabra and Ihsin T. Jonathan Phillips. Flynn and Richard J. Boyd and James J.
- Account Options.
- Making Markets More Inclusive: Lessons from CARE and the Future of Sustainability in Agricultural Value Chain Development.
- Relapse Prevention: Maintenance Strategies in the Treatment of Addictive Behaviors, 2nd Edition?
Shin, Dmitry Goldgof, and Kevin W. Author Index. Bowyer, P.
Description: The two main motivators in computer vision research are to develop algorithms to solve vision problems and to understand and model the human visual system. Specifically, our networks contain blocks that denoise the features using non-local means or other filters; the entire networks are trained end-to-end. When combined with adversarial training, our feature denoising networks substantially improve the state of the art in adversarial robustness in both white-box and black-box attack settings.
Graph-Based Global Reasoning Networks.
Involvement in Scientific Societies
Globally modeling and reasoning over relations between regions can be beneficial for many computer vision tasks on both images and videos. Convolutional neural networks CNNs excel at modeling local relations by convolution operations, but they are typically inefficient at capturing global relations between distant regions and require stacking multiple convolution layers. In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.
After reasoning, relation-aware features are distributed back to the original coordinate space for down-stream tasks. We further present a highly efficient instantiation of the proposed approach and introduce the Global Reasoning unit GloRe unit that implements the coordinate-interaction space mapping by weighted global pooling and weighted broadcasting, and the relation reasoning via graph convolution on a small graph in interaction space.
- Product description.
- Walking Dead Weekly #1!
- Handmade Interiors.
- Africas Development in Historical Perspective;
- Chicken Soup for the Girls Soul: Real Stories by Real Girls About Real Stuff.
- Papers on natural systems;
- An Empirical Evaluation of Preconditioning Data for Accelerating Convex Hull Computations;
The proposed GloRe unit is lightweight, end-to-end trainable and can be easily plugged into existing CNNs for a wide range of tasks. Grounded Video Description. Corso, and Marcus Rohrbach. Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video.
In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our data set, ActivityNet-Entities , augments the challenging ActivityNet Captions data set with K bounding box annotations, each grounding a noun phrase. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our data set, but also show how it can be applied to image description on the Flickr30k Entities data set.
We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video. Jawahar, and Manohar Paluri. Road network extraction from satellite images often produce fragmented road segments leading to road maps unfit for real applications.
Pixel-wise classification fails to predict topologically correct and connected road masks due to the absence of connectivity supervision and difficulty in enforcing topological constraints. In this paper, we propose a connectivity task called Orientation Learning, motivated by the human behavior of annotating roads by tracing it at a specific orientation.
We also develop a stacked multi-branch convolutional module to effectively utilize the mutual information between orientation learning and segmentation tasks.
These contributions ensure that the model predicts topologically correct and connected road masks. We also propose Connectivity Refinement approach to further enhance the estimated road networks. The refinement model is pretrained to connect and refine the corrupted ground-truth masks and later fine-tuned to enhance the predicted road masks. We demonstrate the advantages of our approach on two diverse road extraction data sets SpaceNet  and DeepGlobe .
Facebook research being presented at CVPR
People enjoy food photography because they appreciate food. Behind each meal there is a story described in a complex recipe and, unfortunately, by simply looking at a food image we do not have access to its preparation process. Therefore, in this paper we introduce an inverse cooking system that recreates cooking recipes given food images. Our system predicts ingredients as sets by means of a novel architecture, modeling their dependencies without imposing any order, and then generates cooking instructions by attending to both image and its inferred ingredients simultaneously. We extensively evaluate the whole system on the large-scale Recipe1M data set and show that 1 we improve performance w.
Modern computer vision algorithms have brought significant advancement to 3D geometry reconstruction. However, illumination and material reconstruction remain less studied, with current approaches assuming very simplified models for materials and illumination.
We introduce Inverse Path Tracing, a novel approach to jointly estimate the material properties of objects and light sources in indoor scenes by using an invertible light transport simulation. We assume a coarse geometry scan, along with corresponding images and camera poses.
- Vintage Modern Knits: Contemporary Designs Using Classic Techniques?
- Submission history?
- Economics of the 1%: How Mainstream Economics Serves the Rich, Obscures Reality, and Distorts Policy.
The key contribution of this work is an accurate and simultaneous retrieval of light sources and physically based material properties e. To this end, we introduce a novel optimization method using a differentiable Monte Carlo renderer that computes derivatives with respect to the estimated unknown illumination and material properties.
This enables joint optimization for physically correct light transport and material models using a tailored stochastic gradient descent. Yu-Chuan Su and Kristen Grauman. Given a source CNN for perspective images as input, the KTN produces a function parameterized by a polar angle and kernel as output. Distinct from all existing methods, KTNs allow model transfer: The same model can be applied to different source CNNs with the same base architecture.
This enables application to multiple recognition tasks without retraining the KTN. Validating our approach with multiple source CNNs and data sets, we show that KTNs improve the state of the art for spherical convolution. Current fully supervised video data sets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures.
Scalable active learning for multiclass image classification.
This paper presents an in-depth study of using large volumes of web videos for pretraining video models for the task of action recognition. Our primary empirical finding is that pretraining at a very large scale over 65 million videos , despite on noisy social-media videos and hashtags, substantially improves the state of the art on three challenging public action recognition data sets.
Further, we examine three questions in the construction of weakly supervised video action data sets. First, given that actions involve interactions with objects, how should one construct a verb-object pretraining label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pretraining for good image features sufficient or is pretraining for spatio-temporal features valuable for optimal transfer learning?
Finally, actions are generally less well localized in long videos vs. We present LBS-AE, a self-supervised autoencoding algorithm for fitting articulated mesh models to point clouds. As input, we take a sequence of point clouds to be registered as well as an artist-rigged mesh, i. As output, we learn an LBS-based autoencoder that produces registered meshes from the input point clouds. To bridge the gap between the artist-defined geometry and the captured point clouds, our autoencoder models pose-dependent deviations from the template geometry.
During training, instead of using explicit correspondences, such as key points or pose supervision, our method leverages LBS deformations to bootstrap the learning process. To avoid poor local minima from erroneous point-to-point correspondences, we utilize a structured Chamfer distance based on part-segmentations, which are learned concurrently using self-supervision.
We demonstrate qualitative results on real captured hands and report quantitative evaluations on the FAUST benchmark for body registration.
Our method achieves performance that is superior to other unsupervised approaches and comparable to methods using supervised examples. Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos.
Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos while properly accounting for the inherent noise in the unlabeled training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state of the art for unsupervised highlight detection.
To understand the world, we humans constantly need to relate the present to the past, and put events in context.